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Preface 


These notes grew from an introduction to probability theory taught during 
the first and second term of 1994 at Caltech. There was a mixed audience of 
undergraduates and graduate students in the first half of the course which 
covered Chapters 2 and 3, and mostly graduate students in the second part 
which covered Chapter 4 and two sections of Chapter 5. 


Having been online for many years on my personal web sites, the text got 
reviewed, corrected and indexed in the summer of 2006. It obtained some 
enhancements which benefited from some other teaching notes and research, 
I wrote while teaching probability theory at the University of Arizona in 
Tucson or when incorporating probability in calculus courses at Caltech 
and Harvard University. 


Most of Chapter 2 is standard material and subject of virtually any course 
on probability theory. Also Chapters 3 and 4 is well covered by the litera- 
ture but not in this combination. 


The last chapter ” selected topics” got considerably extended in the summer 
of 2006. While in the original course, only localization and percolation prob- 
lems were included, I added other topics like estimation theory, Vlasov dy- 
namics, multi-dimensional moment problems, random maps, circle-valued 
random variables, the geometry of numbers, Diophantine equations and 
harmonic analysis. Some of this material is related to research I got inter- 
ested in over time. 


While the text assumes no prerequisites in probability, a basic exposure to 
calculus and linear algebra is necessary. Some real analysis as well as some 
background in topology and functional analysis can be helpful. 


I would like to get feedback from readers. I plan to keep this text alive and 
update it in the future. You can email this to knill@math.harvard.edu and 
also indicate on the email if you don’t want your feedback to be acknowl- 
edged in an eventual future edition of these notes. 


4 Contents 


To get a more detailed and analytic exposure to probability, the students 
of the original course have consulted the book [105] which contains much . 
more material than covered in class. Since my course had been taught, 
many other books have appeared. Examples are [21, 34]. 


For a less analytic approach, see [40, 91, 97] or the still excellent classic 
[26]. For an introduction to martingales, we recommend [108] and [47] from 
both of which these notes have benefited a lot and to which the students 
of the original course had access too. 


For Brownian motion, we refer to [73, 66], for stochastic processes to [17], 
for stochastic differential equation to [2, 55, 76, 66, 46], for random walks 
to [100], for Markov chains to [27, 87], for entropy and Markov operators 
[61]. For applications in physics and chemistry, see [106]. 


For the selected topics, we followed [32] in the percolation section. The 
books [101, 30] contain introductions to Vlasov dynamics. The book of [1] 
gives an introduction for the moment problem, [75, 64] for circle-valued 
random variables, for Poisson processes, see [49, 9]. For the geometry of 
numbers for Fourier series on fractals [45]. 


The book [109] contains examples which challenge the theory with counter 
examples. (33, 92, 70] are sources for problems with solutions. 


Probability theory can be developed using nonstandard analysis on finite 
probability spaces [74]. The book [42] breaks some of the material of the 
first chapter into attractive stories. Also texts like (89, 78] are not only for 
mathematical tourists. 


We live in a time, in which more and more content is available online. 
Knowledge diffuses from papers and books to online websites and databases 
which also ease the digging for knowledge in the fascinating field of proba- 
bility theory. 


Oliver Knill 


Chapter 1 


Introduction 


1.1 What is probability theory? 


Probability theory is a fundamental pillar of modern mathematics with 
relations to other mathematical areas like algebra, topology, analysis, ge- 
ometry or dynamical systems. As with any fundamental mathematical con- 
struction, the theory starts by adding more structure to a set 2. In a similar 
way as introducing algebraic operations, a topology, or a time evolution on 
a set, probability theory adds a measure theoretical structure to Q which 
generalizes ”counting” on finite sets: in order to measure the probability 
of a subset A C 2, one singles out a class of subsets A, on which one can 
hope to do so. This leads to the notion of a o-algebra A. It is a set of sub- 
sets of 2 in which on can perform finitely or countably many operations 
like taking unions, complements or intersections. The elements in A are 
called events. If a point w in the ”laboratory” 9 denotes an ” experiment”, 
an "event” A € A is a subset of 9, for which one can assign a proba- 
bility P[A] € [0,1]. For example, if P[A] = 1/3, the event happens with 
probability 1/3. If P[A] = 1, the event takes place almost certainly. The 
probability measure P has to satisfy obvious properties like that the union 
AUB of two disjoint events A, B satisfies PLA U B] = P[A] + P[B] or that 
the complement A‘ of an event A has the probability P[A‘°] = 1 — P[A]. 
With a probability space (Q,.A, P) alone, there is already some interesting 
mathematics: one has for example the combinatorial problem to find the 
probabilities of events like the event to get a "royal flush” in poker. If 2 
is a subset of an Euclidean space like the plane, PLA] = f, f(x,y) drdy 
for a suitable nonnegative function f, we are led to integration problems 
in calculus. Actually, in many applications, the probability space is part of 
Euclidean space and the o-algebra is the smallest which contains all open 
sets. It is called the Borel a-algebra. An important example is the Borel 
o-algebra on the real line. 


Given a probability space (Q,.A, P), one can define random variables X. A 
random variable is a function X from 2 to the real line R which is mea- 
surable in the sense that the inverse of a measurable Borel set B in R is 
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in A. The interpretation is that if w is an experiment, then X (w) mea- 
sures an observable quantity of the experiment. The technical condition of 
measurability resembles the notion of a continuity for a function f from a 
topological space (Q, ©) to the topological space (R,U). A function is con- 
tinuous if f~'(U) € O for all open sets U € U. In probability theory, where 
functions are often denoted with capital letters, like X ,Y,..., a random 
variable X is measurable if X~'(B) € A for all Borel sets B € B. Any 
continuous function is measurable for the Borel o-algebra. As in calculus, 
where one does not have to worry about continuity most of the time, also in 
probability theory, one often does not have to sweat about measurability is- 
sues. Indeed, one could suspect that notions like o-algebras or measurability 
were introduced by mathematicians to scare normal folks away from their 
realms. This is not the case. Serious issues are avoided with those construc- 
tions. Mathematics is eternal: a once established result will be true also in 
thousands of years. A theory in which one could prove a theorem as well as 
its negation would be worthless: it would formally allow to prove any other 
result, whether true or false. So, these notions are not only introduced to 
keep the theory ’clean”, they are essential for the ’survival” of the theory. 
We give some examples of ” paradoxes” to illustrate the need for building 
a careful theory. Back to the fundamental notion of random variables: be- 
cause they are just functions, one can add and multiply them by defining 
(X + Y)(w) = X(w) + Y(w) or (XY)(w) = X(w)Y(w). Random variables 
form so an algebra £. The expectation of a random variable X is denoted 
by E[X) if it exists. It is a real number which indicates the mean” or ”av- 
erage” of the observation X. It is the value, one would expect to measure in 
the experiment. If X = 1 is the random variable which has the value 1 if 
w is in the event B and 0 if w is not in the event B, then the expectation of 
X is just the probability of B. The constant random variable X (w) = a has 
the expectation E[X] = a. These two basic examples as well as the linearity 
requirement E[aX + bY] = aE[X]+bE[Y] determine the expectation for all 
random variables in the algebra £: first one defines expectation for finite 
sums }>"_, ajlg, called elementary random variables, which approximate 
general measurable functions. Extending the expectation to a subset £! of 
the entire algebra is part of integration theory. While in calculus, one can 
live with the Riemann integral on the real line, which defines the integral 
by Riemann sums 1 f(x) dx ~ + S/ne a, {(i/n), the integral defined in 
measure theory is the Lebesgue ideal The later is more fundamental 
and probability theory is a major motivator for using it. It allows to make 
statements like that the probability of the set of real numbers with periodic 
decimal expansion has probability 0. In general, the probability of A is the 
expectation of the random variable X(x) = f(x) = 14(zx). In calculus, the 
integral £ f(x) dx would not be defined because a Riemann integral can 
give 1 or 0 depending on how the Riemann approximation is done. Probabil- 
ity theory allows to introduce the Lebesgue integral by defining Ec f(x) dx 
as the limit of ae f(x;) for n — ce, where 2x; are random uniformly 
distributed points in the interval [a,b]. This Monte Carlo definition of the 
Lebesgue integral is based on the law of large numbers and is as intuitive 
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to state as the Riemann integral which is the limit of + oD dapinelael f (xj) 
for n — oo. 

With the fundamental notion of expectation one can define the variance, 
Var[X] = E[X?] — E[X]? and the standard deviation o[X] = \/Var|X] of a 
random variable X for which X? € £*. One can also look at the covariance 
Cov[XY] = E[XY] — E[X]E[Y] of two random variables X,Y for which - 
X?2,Y2 € £!. The correlation Corr[X,Y] = Cov[XY]/(o[X]o[Y]) of two 
random variables with positive variance is a number which tells how much 
the random variable X is related to the random variable Y. If E[XY] is 
interpreted as an inner product, then the standard deviation is the length 
of X — E[X] and the correlation has the geometric interpretation as cos(a@), 
where a is the angle between the centered random variables X — E[X] and 
Y — E(Y]. For example, if Cov[X,Y] = 1, then Y = AX for some A > 0, if 
Cov[X,Y] = —1, they are anti-parallel. If the correlation is zero, the geo- 
metric interpretation is that the two random variables are perpendicular. 
Decorrelated random variables still can have relations to each other but if 
for any measurable real functions f and g, the random variables f(X ) and 
g(X) are uncorrelated, then the random variables X,Y are independent. 


A random variable X can be described well by its distribution function 
Fx. This is a real-valued function defined as Fx(s) = P[X < s] on R, 
where {X < s } is the event of all experiments w satisfying X(w) < s. The 
distribution function does not encode the internal structure of the random 
variable X; it does not reveal the structure of the probability space for ex- 
ample. But the function Fx allows the construction of a probability space 
with exactly this distribution function. There are two important types of 
distributions, continuous distributions with a probability density function 
fx = F% and discrete distributions for which F is piecewise constant. An 
example of a continuous distribution is the standard normal distribution, 
where fx(r) = e-2/2 /V2n. One can characterize it as the distribution 
with maximal entropy J(f) = — f log(f(x))f(z) da among all distributions 
which have zero mean and variance 1. An example of a discrete distribu- 
tion is the Poisson distribution P[X = k] = e~*4; on N = {0,1,2,... }. 
One can describe random variables by their moment generating functions 
Mx(t) = Ee**] or by their characteristic function $x (t) = E[e'**]. The 
later is the Fourier transform of the law x = F% which is a measure on 
the real line R. 


The law px of the random variable is a probability measure on the real 
line satisfying zx ((a,b]) = Fx (b) — Fx(a). By the Lebesgue decomposition 
theorem, one can decompose any measure pz into a discrete part ppp, an 
absolutely continuous part ja. and a singular continuous part jz... Random 
variables X for which j1x is a discrete measure are called discrete random 
variables, random variables with a continuous law are called continuous 
random variables. Traditionally, these two type of random variables are 
the most important ones. But singular continuous random variables appear 
too: in spectral theory, dynamical systems or fractal geometry. Of course, 
the law of a random variable X does not need to be pure. It can mix the 
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three types. A random variable can be mixed discrete and continuous for 
example... - 


Inequalities play an important role in probability theory. The Chebychev 
inequality P[|X — E[X]| > c] < Var] is used very often. It is a spe- 
cial case of the Chebychev-Markov inequality h(c) - P[X > c] < E[h(X)] 
‘ for monotone nonnegative functions h. Other inequalities are the Jensen 
inequality E[h(X)] > h(E[X]) for convex functions h, the Minkowski in- 
equality ||X + Y||p < ||X||p + ||Y||p or the Hélder inequality ||XY||, < 
|X |lplI¥Ilq,1/p + 1/q = 1 for random variables, X,Y, for which |X lp = 
E(|X|?), ||¥Ilq = E[|Y|¢] are finite. Any inequality which appears in analy- 
sis can be useful in the toolbox of probability theory. 


Independence is an central notion in probability theory. Two events A, B 
are called independent, if P[AM B] = P[A]- P[B]. An arbitrary set of 
events A; is called independent, if for any finite subset of them, the prob- 
ability of their intersection is the product of their probabilities. Two o- 
algebras A,B are called independent, if for any pair A € A,B € B, the 
events A, B are independent. Two random variables X,Y are independent, 
if they generate independent o-algebras. It is enough to check that the 
events A = {X € (a,b)} and B = {Y € (e,d)} are independent for 
all intervals (a,b) and (c,d). One should think of independent random 
variables as two aspects of the laboratory 2 which do not influence each 
other. Each event A = {a < X(w) < b } is independent of the event 
B={c< Y(w) <d}. While the distribution function Fx4y of the sum of 
two independent random variables is a convolution f, Fx (t—s) dFy(s), the 
moment generating functions and characteristic functions satisfy the for- 
mulas Mx +y(t) = Mx(t)My (t) and ¢x+y(t) = dx (t)dy (t). These identi- 
ties make Mx, ¢x valuable tools to compute the distribution of an arbitrary 
finite sum of independent random variables. 


Independence can also be explained using conditional probability with re- 
spect to an event B of positive probability: the conditional probability 
P[A|B] = P[AN B]/P[B] of A is the probability that A happens when we 
know that B takes place. If B is independent of A, then P[A|B] = P[A] but 
in general, the conditional probability is larger. The notion of conditional 
probability leads to the important notion of conditional expectation E[X |B] 
of a random variable X with respect to some sub-o-algebra B of the o al- 
gebra A; it is a new random variable which is B-measurable. For B = A, it 
is the random variable itself, for the trivial algebra B = {0, }, we obtain 
the usual expectation E[X] = E[X|{0,Q }]. If B is generated by a finite 
partition By,...,B, of Q of pairwise disjoint sets covering 2, then E[X |B] 
is piecewise constant on the sets B; and the value on B; is the average 
value of X on B;. If B is the o-algebra of an independent random variable 
Y, then E[X|Y] = E[X|B] = E[X]. In general, the conditional expectation 
with respect to B is a new random variable obtained by averaging on the 
elements of B. One has E[X|Y] = A(Y) for some function h, extreme cases 
being E[X|1] = E[X], E[X|X] = X. An illustrative example is the situation 
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where X(z,y) is a continuous function on the unit square with P = drdy 
as a probability measure and where Y(z,y) = z. In that case, E[X|Y] is 
a function of x alone, given by E[X|Y](x) = f f(x,y) dy. This is called a 
conditional integral. 


A set {Xt}zer of random variables defines a stochastic process. The vari- 
able t € T is a parameter called ”time”. Stochastic processes are to prob- 
ability theory what differential equations are to calculus. An example is a 
family X, of random variables which evolve with discrete time n € N. De- 
terministic dynamical system theory branches into discrete time systems, 
the iteration of maps and continuous time systems, the theory of ordinary 
and partial differential equations. Similarly, in probability theory, one dis- 
tinguishes between discrete time stochastic processes and continuous time 
stochastic processes. A discrete time stochastic process is a sequence of ran- 
dom variables X,, with certain properties. An important example is when 
X,, are independent, identically distributed random variables. A continuous 
time stochastic process is given by a family of random variables X;, where 
t is real time. An example is a solution of a stochastic differential equation. 
With more general time like Z? or R? random variables are called random 
fields which play a role in statistical physics. Examples of such processes 
are percolation processes. 


While one can realize every discrete time stochastic process X,, by a measure- 
preserving transformation T : 2 — Q and X,(w) = X(T"(w)), probabil- 

ity theory often focuses a special subclass of systems called martingales, 

where one has a filtration A, C Anii of o-algebras such that Xp, is An- 

measurable and E[X,|An—1] = Xn-1, where E[X,|An_1] is the conditional 

expectation with respect to the sub-algebra A,_1. Martingales are a pow- 

erful generalization of the random walk, the process of summing up IID 

random variables with zero mean. Similar as ergodic theory, martingale 

theory is a natural extension of probability theory and has many applica- 

tions. 


The language of probability fits well into the classical theory of dynam- 
ical systems. For example, the ergodic theorem of Birkhoff for measure- 
preserving transformations has as a special case the law of large numbers 
which describes the average of partial sums of random variables + }0y"_, Xk- 
There are different versions of the law of large numbers. ” Weak laws” 
make statements about convergence in probability, ”strong laws” make 
statements about almost everywhere convergence. There are versions of 
the law of large numbers for which the random variables do not need to 
have a common distribution and which go beyond Birkhoff’s theorem. An 
other important theorem is the central limit theorem which shows that 
Sy = X1 + Xo4+-:-+ Xn normalized to have zero mean and variance 1 
converges in law to the normal distribution or the law of the iterated loga- 
rithm which says that for centered independent and identically distributed 
X;, the scaled sum S,,/A, has accumulation points in the interval [—o, 0] 
if An = V/2nloglogn and a is the standard deviation of X;. While stating 
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the weak and strong law of large numbers and the central limit theorem, 
different convergence notions for random variables appear: almost sure con- 
vergence is the strongest, it implies convergence in probability and the later 
implies convergence convergence in law. There is also L'-convergence which 
is stronger than convergence in probability. 


As in the deterministic case, where the theory of differential equations is 
more technical than the theory of maps, building up the formalism for 
continuous time stochastic processes X; is more elaborate. Similarly as 
for differential equations, one has first to prove the existence of the ob- 
jects. The most important continuous time stochastic process definitely is 
Brownian motion B;. Standard Brownian motion is a stochastic process 
which satisfies By = 0, E[B,] = 0, Cov[B,, B:] = s for s < t and for 
any sequence of times, 0 = to < t) < --- < t; < tia, the increments 
Bt,,, — Bz, are all independent random vectors with normal distribution. 
Brownian motion B; is a solution of the stochastic differential equation 
4B, = C(t), where ¢(t) is called white noise. Because white noise is only 
defined as a generalized function and is not a stochastic process by itself, 
this stochastic differential equation has to be understood in its integrated 
form By = f; dB, = fy ¢(s) ds. 


More generally, a solution to a stochastic differential equation 4X, = 
f(X+)¢(t) + g(Xt) is defined as the solution to the integral equation X; = 


Xo+ ve f(Xs) dBi + i g(Xs) ds. Stochastic differential equations can 


be defined in different ways. The expression i f(Xs) dB, can either be 
defined as an Ito integral, which leads to martingale solutions, or the 
Stratonovich integral, which has similar integration rules than classical 
differentiation equations. Examples of stochastic differential equations are 
&Xt = X+¢(t) which has the solution X, = e?~*/2. Or 4X, = BAc(t) 
which has as the solution the process X; = B? —10B3+15B,. The key tool 
to solve stochastic differential equations is Ito’s formula f(B,) — f(Bo) = 
i f'(Bs)dB, + if, f’ (Bs) ds, which is the stochastic analog of the fun- 
damental theorem of calculus. Solutions to stochastic differential equations 
are examples of Markov processes which show diffusion. Especially, the so- 
lutions can be used to solve classical partial differential equations like the 
Dirichlet problem Au = 0 in a bounded domain D with u = f on the 
boundary 6D. One can get the solution by computing the expectation of 
f at the end points of Brownian motion starting at x and ending at the 
boundary u = E;[f(Br)]. On a discrete graph, if Brownian motion is re- 
placed by random walk, the same formula holds too. Stochastic calculus is 
also useful to interpret quantum mechanics as a diffusion processes (73, 71] 
or as a tool to compute solutions to quantum mechanical problems using 
Feynman-Kac formulas. 


Some features of stochastic process can be described using the language of 
Markov operators P, which are positive and expectation-preserving trans- 
formations on £'. Examples of such operators are Perron-Frobenius op- 
erators X — X(T) for a measure preserving transformation T defining a 
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discrete time evolution or stochastic matrices describing a random walk 
on a finite graph. Markov operators can be defined by transition proba- 
. bility functions which are measure-valued random variables. The interpre- 
tation is that from a given point w, there are different possibilities to go 
to. A transition probability measure P(w,-) gives the distribution of the 
target. The relation with Markov operators is assured by the Chapman- 
Kolmogorov equation P"t™ = P™o P™. Markov processes can be obtained 
from random transformations, random walks or by stochastic differential 
equations. In the case of a finite or countable target space S, one obtains 
Markov chains which can be described by probability matrices P, which 
are the simplest Markov operators. For Markov operators, there is an ar- 
row of time: the relative entropy with respect to a background measure 
is non-increasing. Markov processes often are attracted by fixed points of 
the Markov operator. Such fixed points are called stationary states. They 
describe equilibria and often they are measures with maximal entropy. An 
example is the Markov operator P, which assigns to a probability density 
fy the probability density of fy;x where Y + X is the random variable 
Y + X normalized so that it has mean 0 and variance 1. For the initial 
function f = 1, the function P"(fx) is the distribution of S* the nor- 
malized sum of n IID random variables X;. This Markov operator has a 
unique equilibrium point, the standard normal distribution. It has maxi- 
mal entropy among all distributions on the real line with variance 1 and 
mean 0. The central limit theorem tells that the Markov operator P has 
the normal distribution as a unique attracting fixed point if one takes the 
weaker topology of convergence in distribution on £’. This works in other 
situations too. For circle-valued random variables for example, the uniform 
distribution maximizes entropy. It is not surprising therefore, that there is 
a central limit theorem for circle-valued random variables with the uniform 
distribution as the limiting distribution. 


In the same way as mathematics reaches out into other scientific areas, 
probability theory has connections with many other branches of mathe- 
matics. The last chapter of these notes give some examples. The section 
on percolation shows how probability theory can help to understand criti- 
cal phenomena. In solid state physics, one considers operator-valued ran- 
dom variables. The spectrum of random operators are random objects too. 
One is interested what happens with probability one. Localization is the 
phenomenon in solid state physics that sufficiently random operators of- 
ten have pure point spectrum. The section on estimation theory gives a 
glimpse of what mathematical statistics is about. In statistics one often 
does not know the probability space itself so that one has to make a statis- 
tical model and look at a parameterization of probability spaces. The goal 
is to give maximum likelihood estimates for the parameters from data and 
to understand how small the quadratic estimation error can be made. A 
section on Vlasov dynamics shows how probability theory appears in prob- 
lems of geometric evolution. Vlasov dynamics is a generalization of the 
n-body problem to the evolution of of probability measures. One can look 
at the evolution of smooth measures or measures located on surfaces. This 
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deterministic stochastic system produces an evolution of densities which 
can form singularities without doing harm to the formalism. It also defines 
the evolution of surfaces. The section on moment problems is part of multi- 
variate statistics. As for random variables, random vectors can be described 
by their moments. Since moments define the law of the random variable, 
the question arises how one can see from the moments, whether we have a 
continuous random variable. The section of random maps is an other part 
of dynamical systems theory. Randomized versions of diffeomorphisms can 
be considered idealization of their undisturbed versions. They often can 
be understood better than their deterministic versions. For example, many 
random diffeomorphisms have only finitely many ergodic components. In 
the section in circular random variables, we see that the Mises distribu- 
tion has extremal entropy among all circle-valued random variables with 
given circular mean and variance. There is also a central limit theorem 
on the circle: the sum of IID circular random variables converges in law 
to the uniform distribution. We then look at a problem in the geometry 
of numbers: how many lattice points are there in a neighborhood of the 
graph of one-dimensional Brownian motion? The analysis of this problem 
needs a law of large numbers for independent random variables X, with 
uniform distribution on [0,1]: for 0 < 5 < 1, and An = [0,1/n°] one has 
limp—co 4 hI Lan (Xe) = 1. Probability theory also matters in complex- 
ity theory as a section on arithmetic random variables shows. It turns out 
that random variables like X,(k) = k, Y,(k) = k? + 3 mod n defined on 
finite probability spaces become independent in the limit n — oo. Such 
considerations matter in complexity theory: arithmetic functions defined 
on large but finite sets behave very much like random functions. This is 
reflected by the fact that the inverse of arithmetic functions is in general 
difficult to compute and belong to the complexity class of NP. Indeed, if 
one could invert arithmetic functions easily, one could solve problems like 
factoring integers fast. A short section on Diophantine equations indicates 
how the distribution of random variables can shed light on the solution 
of Diophantine equations. Finally, we look at a topic in harmonic analy- 
sis which was initiated by Norbert Wiener. It deals with the relation of 
the characteristic function ¢x and the continuity properties of the random 
variable X. 


1.2 Some paradoxes in probability theory 


Colloquial language is not always precise enough to tackle problems in 
probability theory. Paradoxes appear, when definitions allow different in- 
terpretations. Ambiguous language can lead to wrong conclusions or con- 
tradicting solutions. To illustrate this, we mention a few problems. The 
following four examples should serve as a motivation to introduce proba- 
bility theory on a rigorous mathematical footing. 


1) Bertrand’s paradox (Bertrand 1889) 
We throw at random lines onto the unit disc. What is the probability that 
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the line intersects the dise with a length > V3, the length of the inscribed 
equilateral triangle? 


First answer: take an arbitrary point P on the boundary of the disc. The 
set of all lines through that point are parameterized by an angle @. In order 
that the chord is longer than 3, the line has to lie within a sector of 60° 
within a range of 180°. The probability is 1/3. 


Second answer: take all lines perpendicular to a fixed diameter. The chord 
is longer than V/3 if the point of intersection lies on the middle half of the 
diameter. The probability is 1/2. 


Third answer: if the midpoints of the chords lie in a disc of radius 1/2, the 
chord is longer than 3. Because the disc has a radius which is half the 
radius of the unit dise, the probability is 1/4. 


SS 
— Say } 
| 
Figure. Random an- Figure. Random Figure. Random area. 
gle. translation. 


Like most paradoxes in mathematics, a part of the question in Bertrand’s 
problem is not well defined. Here it is the term "random line”. The solu- 
tion of the paradox lies in the fact that the three answers depend on the 
chosen probability distribution. There are several “natural” distributions. 
The actual answer depends on how the experiment is performed. 


2) Petersburg paradox (D.Bernoulli, 1738) 

In the Petersburg casino, you pay an entrance fee c and you get the prize 
27, where T is the number of times, the casino flips a coin until “head” 
appears. For example, if the sequence of coin experiments would give ” tail, 
tail, tail, head”, you would win 2° — ¢ = 8 —c, the win minus the entrance 
fee. Fair would be an entrance fee which is equal to the expectation of the 


win, which is 
x 


SP =k = > l=oo. 
k=1 


k=1 


The paradox is that nobody would agree to pay even an entrance fee c = 10. 
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The problem with this casino is that it is not quite clear, what is ” fair”. 
For example, the situation T = 20 is so improbable that it never occurs 
in the life-time of a person. Therefore, for any practical reason, one has 
not to worry about large values of T. This, as well as the finiteness of 
money resources is the reason, why casinos do not have to worry about the 
following bullet proof martingale strategy in roulette: bet c dollars on red. 
If you win, stop, if you lose, bet 2c dollars on red. If you win, stop. If you 
lose, bet 4c dollars on red. Keep doubling the bet. Eventually after n steps, 
red will occur and you will win 2"c — (c+ 2c+-+-+2"71c) = c dollars. 
This example motivates the concept of martingales. Theorem (3.2.7) or 
proposition (3.2.9) will shed some light on this. Back to the Petersburg 
paradox. How does one resolve it? What would be a reasonable entrance 
fee in ”real life’? Bernoulli proposed to replace the expectation E{G] of the 
profit G = 27 with the expectation (E[VG])?, where u(x) = / is called a 
utility function. This would lead to a fair entrance 


[o.<) 
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It is not so clear if that is a way out of the paradox because for any proposed 
utility function u(k), one can modify the casino rule so that the paradox 
reappears: pay (2*)? if the utility function u(k) = Vk or pay e?’ dollars, 
if the utility function is u(k) = log(k). Such reasoning plays a role in 
economics and social sciences. 


~ 5.828... . 


Figure. The picture to the right 

shows the average profit devel- 

opment during a typical tourna- : we 
ment of 4000 Petersburg games. 
After these 4000 games, the ; 
player would have lost about 10 re 
thousand dollars, when paying a 

10 dollar entrance fee each game. 

The player would have to play a \. 
very, very long time to catch up. F 
Mathematically, the player will i N 
do so and have a profit in the Ps 
long run, but it is unlikely that fas — paren ee 
it will happen in his or her life , 

time. 


3) The three door problem (1991) Suppose you’re on a game show and 
you are given a choice of three doors. Behind one door is a car and behind 
the others are goats. You pick a door-say No. 1 - and the host, who knows 
what’s behind the doors, opens another door-say, No. 3-which has a goat. 
(In all games, he opens a door to reveal a goat). He then says to you, ”Do 
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you want to pick door No. 2?” (In all games he always offers an option to 
switch). Is it to your advantage to switch your choice? 


The problem is also called ”Monty Hall problem” and was discussed by 
Marilyn vos Savant in a ”Parade” column in 1991 and provoked a big 
controversy. (See [98] for pointers and similar examples.) The problem is 
that intuitive argumentation can easily lead to the conclusion that it does 
not matter whether to change the door or not. Switching the door doubles 
the chances to win: 


No switching: you choose a door and win with probability 1 /3. The opening 
of the host does not affect any more your choice. 

Switching: when choosing the door with the car, you loose since you switch. 
If you choose a door with a goat. The host opens the other door with the 
goat and you win. There are two such cases, where you win. The probability 
to win is 2/3. 


4) The Banach-Tarski paradox (1924) 

It is possible to cut the standard unit ball Q = {x € R® | |2| < 1} into 5 
disjoint pieces Q = Y, UY2UY3UY4,UYs and rotate and translate the pieces 
with transformations T; so that T;(¥;) UT2(Y2) = 9 and T3(¥3) UT,(Y4) U 
Ts(Y5) = is a second unit ball 0’ = {x € R° | | — (3, 0,0)| < 1} and all 
the transformed sets again don’t intersect. 

While this example of Banach-Tarski is spectacular, the existence of bounded 
subsets A of the circle for which one can not assign a translational invari- 
ant probability P[A] can already be achieved in one dimension. The Italian 
mathematician Giuseppe Vitali gave in 1905 the following example: define 
an equivalence relation on the circle T = [0, 27) by saying that two angles 
are equivalent x ~ y if (c—y)/7 is a rational angle. Let A be a subset in the 
circle which contains exactly one number from each equivalence class. The 
axiom of choice assures the existence of A. If £1,22,... 1S a enumeration 
of the set of rational angles in the circle, then the sets A; = A+ 2; are 
pairwise disjoint and satisfy 7°, A; = T. If we could assign a translational 
invariant probability P[A;] to A, then the basic rules of probability would 
give 


But there is no real number p = P[A] = P[A;] which makes this possible. 
Both the Banach-Tarski as well as Vitalis result shows that one can not 
hope to define a probability space on the algebra A of all subsets of the unit 
ball or the unit circle such that the probability measure is translational 
and rotational invariant. The natural concepts of ” length” or ” volume”, 
which are rotational and translational invariant only makes sense for a 
smaller algebra. This will lead to the notion of o-algebra. In the context 
of topological spaces like Euclidean spaces, it leads to Borel o-algebras, 
algebras of sets generated by the compact sets of the topological space. 
This language will be developed in the next chapter. 
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Probability theory is a central topic in mathematics. There are close re- 
lations and intersections with other fields like computer science, ergodic 
theory and dynamical systems, cryptology, game theory, analysis, partial 
differential equation, mathematical physics, economical sciences, statistical 
mechanics and even number theory. As a motivation, we give some prob- 
lems and topics which can be treated with probabilistic methods. 


1) Random walks: (statistical mechanics, gambling, stock markets, quan- 
tum field theory). 

Assume you walk through a lattice. At each vertex, you choose a direction 
at random. What is the probability that you return back to your start- 
ing point? Polya’s theorem (3.8.1) says that in two dimensions, a random 
walker almost certainly returns to the origin arbitrarily often, while in three 
dimensions, the walker with probability 1 only returns a finite number of 
times and then escapes for ever. 


Figure. A random Figure. A piece of @ Figure. A piece of a 
walk in one dimen- random walk in two 
sions displayed as a dimensions. 

graph (t, Bz). 


random walk in three 
dimensions. 


2) Percolation problems (model of a porous medium, statistical mechanics, 
critical phenomena). 

Each bond of a rectangular lattice in the plane is connected with probability 
p and disconnected with probability 1 — p. Two lattice points x,y in the 
lattice are in the same cluster, if there is a path from x to y. One says that 
percolation occurs” if there is a positive probability that an infinite cluster 
appears. One problem is to find the critical probability p,, the infimum of all 
p, for which percolation occurs. The problem can be extended to situations, 
where the switch probabilities are not independent to each other. Some 
random variables like the size of the largest cluster are of interest near the 
critical probability pe. 
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Figure. Bond percola- 


tion with p=0.4. 


Figure. Bond percola- 
tion with p=0.6. 


A variant of bond percolation is site percolation where the nodes of the 
lattice are switched on with probability p. 


Figure. Site percola- 


tion with p=0.2. 


Figure. Site percola- 
tion with p=0.4. 


Figure. Site percola- 
tion with p=0.6. 


Generalized percolation problems are obtained, when the independence 
of the individual nodes is relaxed. A class of such dependent percola- 
tion problems can be obtained by choosing two irrational numbers a, 3 
like a = /2—1 and @ = V3 — 1 and switching the node (n,m) on if 
(na + mf) mod 1 € [0,p). The probability of switching a node on is again 
p, but the random variables 


are no more independent. 


Xnm = l(na+mB) mod 1€[0,p) 
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Figure. Dependent _ Figure. Dependent Figure. Dependent 
site percolation with site percolation with site percolation with 
p=0.2. p=0.4. p=0.6. 


Even more general percolation problems are obtained, if also the distribu- 
tion of the random variables X,,,, can depend on the position (n,m). 


3) Random Schrédinger operators. (quantum mechanics, functional analy- 
sis, disordered systems, solid state physics) 


Consider the linear map Lu(n) = Dionet u(n) + V(n)u(n) on the space 
of sequences u = (...,U—2, U-1, Uo, U1, U2,-.. ). We assume that V(n) takes 
random values in {0,1}. The function V is called the potential. The problem 
is to determine the spectrum or spectral type of the infinite matrix L on 
the Hilbert space [? of all sequences u with finite ||u|]3 = 30°. u?. 
The operator L is the Hamiltonian of an electron in a one-dimensional 
disordered crystal. The spectral properties of L have a relation with the 
conductivity properties of the crystal. Of special interest is the situation, 
where the values V(n) are all independent random variables. It turns out 
that if V(n) are IID random variables with a continuous distribution, there 
are many eigenvalues for the infinite dimensional matrix L - at least with 


probability 1. This phenomenon is called localization. 
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Figure. A 
yt) = 
evolving in a random 
potential at t = 0. 
Shown are both the 
potential V, and the 
wave u(0). 


Figure. A wave 
v(t) = ep(0) 
evolving in a random 
potential at t = 1. 
Shown are both the 
potential V, and the 
wave (1). 
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Figure. A 
pt) = 
evolving in a random 
potential at t = 2. 
Shown are both the 
potential V, and the 
wave (2). 


wave 


e'/ty(0) 


More general operators are obtained by allowing V(n) to be random vari- 
ables with the same distribution but where one does not persist on indepen- 
dence any more. A well studied example is the almost Mathieu operator, 
where V(n) = \cos(@ + na) and for which a/(27) is irrational. 


4) Classical dynamical systems (celestial mechanics, fluid dynamics, me-’ 
chanics, population models) 


The study of deterministic dynamical systems like the logistic map x +> 
4x(1— x) on the interval [0,1] or the three body problem in celestial me- 
chanics has shown that such systems or subsets of it can behave like random 
systems. Many effects can be described by ergodic theory, which can be 
seen as a brother of probability theory. Many results in probability the- 
ory generalize to the more general setup of ergodic theory. An example is 
Birkhoff’s ergodic theorem which generalizes the law of large numbers. 
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Figure. Iterating the Figure. The simple Figure. A short time 
logistic map evolution of the New- 


tonian three body 
problem. There are 
energies and subsets 
of the energy surface 
which are invari- 
ant and on which 
there is an invariant 
probability measure. 


mechanical system of 
a double pendulum 
exhibits complicated 
dynamics. The dif- 
ferential equation 
defines a measure 
preserving flow T; on 
a probability space. 


T(x) = 4x(1 - 2) 


on [0,1] produces 
independent random 
variables. The in- 
variant measure P is 
continuous. 


Given a dynamical system given by a map T or a flow T; on a subset Q of 
some Euclidean space, one obtains for every invariant probability measure 
P a probability space (Q,A,P). An observed quantity like a coordinate of 
an individual particle is a random variable X and defines a stochastic pro- 
cess X,(w) = X(T"w). For many dynamical systems including also some 3 
body problems, there are invariant measures and observables X for which 
X,, are IID random variables. Probability theory is therefore intrinsically 
relevant also in classical dynamical systems. 


5) Cryptology. (computer science, coding theory, data encryption) 


Coding theory deals with the mathematics of encrypting codes or deals 
with the design of error correcting codes. Both aspects of coding theory 
have important applications. A good code can repair loss of information 
due to bad channels and hide the information in an encrypted way. While 
many aspects of coding theory are based in discrete mathematics, number 
theory, algebra and algebraic geometry, there are probabilistic and combi- 
natorial aspects to the problem. We illustrate this with the example of a 
public key encryption algorithm whose security is based on the fact that 
it is hard to factor a large integer N = pq into its prime factors p,q but 
easy to verify that p,q are factors, if one knows them. The number N can 
be public but only the person, who knows the factors p,q can read the 
message. Assume, we want to crack the code and find the factors p and q. 


The simplest method is to try to find the factors by trial and error but this is 
impractical already if N has 50 digits. We would have to search through 1025 
numbers to find the factor p. This corresponds to probe 100 million times 
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every second over a time span of 15 billion years. There are better methods 
known and we want to illustrate one of them now: assume we want to find 
the factors of N = 11111111111111111111111111111111111111111111111. 
The method goes as follows: start with an integer a and iterate the quadratic 
map T(x) = x? +c mod N on {0,1.,,,.N —1 }. If we assume the numbers 
ro = 4,2; = T(a), 22 = T(T(a))... to be random, how many such numbers 
do we have to generate, until two of them are the same modulo one of the 
prime factors p? The answer is surprisingly small and based on the birthday 
paradox: the probability that in a group of 23 students, two of them have the 
same birthday is larger than 1/2: the probability of the event that we have 
no birthday match is 1(364/365)(363/365) - - -(343/365) = 0.492703..., so 
that the probability of a birthday match is 1 — 0.492703 = 0.507292. This 
is larger than 1/2. If we apply this thinking to the sequence of numbers 
x; generated by the pseudo random number generator T,, then we expect 
to have a chance of 1/2 for finding a match modulo p in ,/p iterations. 
Because p < \/n, we have to try N1/4 numbers, to get a factor: if x, and 
Im are the same modulo p, then gcd(tn — Lm, N) produces the factor p of 
N. In the above example of the 46 digit number N, there is a prime factor 
p = 35121409. The Pollard algorithm finds this factor with probability 1/2 
in \/p = 5926 steps. This is an estimate only which gives the order of mag- 
nitude. With the above N, if we start with a = 17 and take a = 3, then we 
have a match 227729 = Z13860- It can be found very fast. 


This probabilistic argument would give a rigorous probabilistic estimate 
if we would pick truly random numbers. The algorithm of course gener- 
ates such numbers in a deterministic way and they are not truly random. 
The generator is called a pseudo random number generator. It produces 
numbers which are random in the sense that many statistical tests can 
not distinguish them from true random numbers. Actually, many random 
number generators built into computer operating systems and program- 
ming languages are pseudo random number generators. 


Probabilistic thinking is often involved in designing, investigating and at- 
tacking data encryption codes or random number generators. 


6) Numerical methods. (integration, Monte Carlo experiments, algorithms) 
In applied situations, it is often very difficult to find integrals directly. This 
happens for example in statistical mechanics or quantum electrodynamics, 
where one wants to find integrals in spaces with a large number of dimen- 
sions. One can nevertheless compute numerical values using Monte Carlo 
Methods with a manageable amount of effort. Limit theorems assure that 
these numerical values are reasonable. Let us illustrate this with a very 
simple but famous example, the Buffon needle problem. 


A stick of length 2 is thrown onto the plane filled with parallel lines, all 
of which are distance d = 2 apart. If the center of the stick falls within 
distance y of a line, then the interval of angles leading to an intersection 
with a grid line has length 2arccos(y) among a possible range of angles 
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(0, 7]. The probability of hitting a line is therefore ips 2 arccos(y)/m = 2/7. 
This leads to a Monte Carlo method to compute 7. Just throw randomly 
n sticks onto the plane and count the number k of times, it hits a line. The 
number 2n/k is an approximation of 7. This is of course not an effective 
way to compute 7 but it illustrates the principle. 


Figure. The Buffon needle prob- 
lem is a Monte Carlo method 
to compute m. By counting the 
number of hits in a sequence of 
experiments, one can get ran- 
dom approximations of x. The 
law of large numbers assures that 
the approximations will converge 
to the expected limit. All Monte 
Carlo computations are theoreti- 
cally based on limit theorems. 


Chapter 2 


Limit theorems 


2.1 Probability spaces, random variables, indepen- 
dence 


In this section we define the basic notions of a ”probability space” and 
”random variables” on an arbitrary set 2. 


Definition. A set A of subsets of 2 is called a o-algebra if the following 
three properties are satisfied: 


(i) QE A, 
(ii) AE AS AS=2\AEA, 
(iii) An € AS Unen An EA 


A pair (9, A) for which A is a o-algebra in 2 is called a measurable space. 


Properties. If A is a o-algebra, and A, is a sequence in A, then the fol- 
lowing properties follow immediately by checking the axioms: 

1) New An € A. 

2) limsup,, An := (pe UP, An € A. 

3) liminfy An := UR Men An EA. 

4) A, B are algebras, then AN B is an algebra. 

5) If {Ay }ies is a family of o- sub-algebras of A. then (),-; Ai is a o-algebra. 


Example. For an arbitrary set 2, A = {0,}) is a o-algebra. It is called 
the trivial o-algebra. : 


Example. If 9 is an arbitrary set, then A = {A C (}) is a o-algebra. The 
set of all subsets of 2 is the largest o-algebra one can define on a set. 
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Example. A finite set of subsets A), A9,...,A, of 2 which are pairwise 
disjoint and whose union is 0, it is called a partition of (. It generates the 
o-algebra: A = {A =,<, A; } where J runs over all subsets of {1, yn}. 
This o-algebra has 2” elements. Every finite o-algebra is of this form. The 
smallest nonempty elements {Aj,... , An} of this algebra are called atoms. 


Definition. For any set C of subsets of 2, we can define o(C), the smallest 
o-algebra A which contains C. The o-algebra A is the intersection of all 
o-algebras which contain C. It is again a o-algebra. 


Example. For 2 = {1,2,3}, the set C = {{1,2}, {2,3 }} generates the 
o-algebra A which consists of all 8 subsets of . 


Definition. If (EO) is a topological space, where © is the set of open sets 
in E. then o(Q) is called the Borel o-algebra of the topological space. If 
A c B, then J is called a subalgebra of B. A set B in B is also called a 
Borel set. 


Remark. One sometimes defines the Borel o-algebra as the o-algebra gen- 
erated by the set of compact sets C of a topological space. Compact sets 
in a topological space are sets for which every open cover has a finite sub- 
cover. In Euclidean spaces R", where compact sets coincide with the sets 
which are both bounded and closed, the Borel o-algebra generated by the 
compact sets is the same as the one generated by open sets. The two def- 
initions agree for a large class of topological spaces like ” locally compact 
separable metric spaces”. 


Remark. Often, the Borel o-algebra is enlarged to the g-algebra of all 
Lebesgue measurable sets, which includes all sets B which are a subset 
of a Borel set A of measure 0. The smallest o-algebra B which contains 
all these sets is called the completion of B. The completion of the Borel 
o-algebra is the o-algebra of all Lebesgue measurable sets. It is in general 
strictly larger than the Borel o-algebra. But it can also have pathological 
features like that the composition of a Lebesgue measurable function with 
a continuous functions does not need to be Lebesgue measurable any more. 
(See [109], Example 2.4). : 


Example. The o-algebra generated by the open balls C = {A = B,(z) } of 
a metric space (X,d) need not to agree with the family of Borel subsets, 
which are generated by O, the set of open sets in (X,d). : 

Proof. Take the metric space (R,d) where d(x, y) = 1,,—y} is the discrete 
metric. Because any subset of R is open, the Borel o-algebra is the set of 
all subsets of R. The open balls in R are either single points or the whole 
space. The o-algebra generated by the open balls is the set of countable 
subset of R together with their complements. 
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Example. If 2 = [0,1] x [0,1] is the unit square and C is the set of all sets 
of the form [0,1] x [a,b] with 0 <a <b <1, then o(C) is the o-algebra of 
all sets of the form [0,1] x A, where A is in the Borel o-algebra of (0, 1}. 


Definition. Given a measurable space (92,A). A function P : A > R is 
called a probability measure and (2, A, P) is called a probability space if 
the following three properties called Kolmogorov axioms are satisfied: 


(i) P[A] > 0 for all Ac A, 
(ii) P[Q} = 1, 
(iii) An € A disjoint > P[U,, 4n] = 0, P[An] 


The last property is called o-additivity. 


Properties. Here are some basic properties of the probability measure 
which immediately follow from the definition: 

1) P[@] =0. 

Ac B= PIA] < P[B]. 


Remark. There are different ways to build the axioms for a probability 
space. One could for example replace (i) and (ii) with properties 4),5) in 
the above list. Statement 6) is equivalent to o-additivity if P is only assumed 
to be additive. 


Remark. The name ”Kolmogorov axioms” honors a monograph of Kol- 
mogorov from 1933 [53] in which an axiomatization appeared. Other math- 
ematicians have formulated similar axiomatizations at the same time, like 
Hans Reichenbach in 1932. According to Doob, axioms (i)-(iii) were first 
proposed by G. Bohlmann in 1908 [22]. 


Definition. A map X from a measure space ({9,.A) to an other measure 
space (A, B) is called measurable, if X~'(B) € A for all B € B. The set 
X~1(B) consists of all points z € Q for which X(x) € B. This pull back set 
X~1(B) is defined even if X is non-invertible. For example, for X(x) = 2? 
on (R, B) one has X~1({1, 4]) = [1, 2] U [-2, —1]. 


Definition. A function X : 2 — R is called a random variable, if it is a 
measurable map from (Q,.A) to (R, 8), where B is the Borel o-algebra of 
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R. Denote by £ the set of all real random variables. The set £ is an alge- 
bra under addition and multiplication: one can add and multiply random 
variables and gets new random variables. More generally, one can consider 
random variables taking values in a second measurable space (£, B). If 
E = R¢, then the random variable X is called a random vector. For a ran- 
dom vector X = (Xj,...,Xa), each component X; is a random variable. 


Example. Let 2 = R? with Borel o-algebra A and let 


] 2 2 
P[A] = mee “YI? dady . 


Any continuous function X of two variables is a random variable on 9. For 
example, X(z,y) = xy(x + y) is a random variable. But also X (Z.9).= 
1/(z + y) is a random variable, even so it is not continuous. The vector- 
valued function X (x,y) = (z,y,z°) is an example of a random vector. 


Definition. Every random variable X defines a o-algebra 
X~)(B) ={X-(B)|BeB}. 


We denote this algebra by o(X) and call it the o-algebra generated by X. 
Example. A constant map X (zx) = c defines the trivial algebra A = {0,0 }. 


Example. The map X(z,y) = x from the square 2 = [0,1] x [0,1] to the 
real line R defines the algebra B = {A x [0,1] }, where A is in the Borel 
o-algebra of the interval [0, 1]. 


Example. The map X from Zg = {0,1,2,3,4,5} to {0,1} C R defined by 
X(z) = x mod 2 has the value X(zx) = 0 if z is even and X(zx) = 1 if z is 
odd. The o-algebra generated by X is A = {0, {1,3,5}, {0, 2,4},Q}. 


Definition. Given a set B € A with P[B] > 0, we define 


P[AN B] 


P[A|B] = Pia)’ 


the conditional probability of A with respect to B. It is the probability of 
the event A, under the condition that the event B happens. 


Example. We throw two fair dice. Let A be the event that the first dice is 
6 and let B be the event that the sum of two dices is 11. Because P[B] = 
2/36 = 1/18 and P[ANM B] = 1/36 (we need to throw a 6 and then a 5), 
we have P[A|B] = (1/16)/(1/18) = 1/2. The interpretation is that since 
we know that the event B happens, we have only two possibilities: (5,6) 
or (6,5). On this space of possibilities, only the second is compatible with 
the event B. 
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Exercice. a) Verify that the Sicherman dices with faces (1,3, 4,5,6,8) and 
(1,2, 2,3,3,4) have the property that the probability of getting the value 
k is the same as with a pair of standard dice. For example, the proba- 
bility to get 5 with the Sicherman dices is 3/36 because the three cases 
(1,4), (3,2),(3,2) lead to a sum 5. Also for the standard dice, we have 
three cases (1,4), (2,3), (3,2). 

b) Three dices A, B,C are called non-transitive, if the probability that A > 
B is larger than 1/2, the probability that B > C is larger than 1/2 and the 
probability that C > A is larger than 1/2. Verify the nontransitivity prop- 
erty for A = (1,4,4,4,4,4), B = (3,3,3,3,3,6) and C = (2, 2,2,5,5,5). 


Properties. The following properties of conditional probability are called 
Keynes postulates. While they follow immediately from the definition 
of conditional probability, they are historically interesting because they 
appeared already in 1921 as part of an axiomatization of probability theory: 


1) P[A|B] > 0. 
2) P[A|A] = 1. 
3) P[A|B] + P[A‘|B] = 1. 


4) P[AN BIC] = P[A|C]- P[B|ANC}. 


Definition. A finite set {Ai,...,An } C Ais called a finite partition of 2 if 
Uj_1 Aj = 2 and A; A; = 0 for i # j. A finite partition covers the entire 
space with finitely many, pairwise disjoint sets. 


If all possible experiments are partitioned into different events A; and the 
probabilities that B occurs under the condition Aj, then one can compute 
the probability that A; occurs knowing that B happens: 


Theorem 2.1.1 (Bayes rule). Given a finite partition {Aj,..,A,} in A and 
B € A with P{B] > 0, one has 


P[B|A;]P[Ai] 


PIAIB| = =" PIBIA, PIA 


Proof. Because the denominator is P{B] = }>"_, P[B|A;]P[Aj], the Bayes 
rule just says P|A;|B)P{B] = P[B|A,|P{A;]. But these are by definition 
both P[A; 9 B). Oo 
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Example. A fair dice is rolled first. It gives a random number k from 
{1,2,3,4,5,6}. Next, a fair coin is tossed k times. Assume, we know that 
all coins show heads, what is the probability that the score of the dice was 
equal to 5? 

Solution. Let B be the event that all coins are heads and let A; be the 
event that the dice showed the number j. The problem is to find P[As|B]. 
We know P[B|A;] = 2-4. Because the events A;,j = 1,...,6 form a par- 
tition of 0, we have P[B] = °° _, P[BN Aj] = Dy=1 PIBIA,|P[Aj] = 
pee 2-9/6 = (14+1/24+1/34 1/4+ 1/5 + 1/6)/6 = 49/120. By Bayes 


rule, 


___ PIBIASIPIAs] —_ (4/32)(1/6) __5. 
*USIEI= 55 PIBIA PIA) ~ 497120 ~ 300 


which is about 1 percent. 


Example. The Girl-Boy problem: ” Dave has two child. One child is a boy. 
What is the probability that the other child is a girl”? 


Most people would intuitively say 1/2 because the second event looks inde- 
pendent of the first. However, it is not and the initial intuition is mislead- 
ing. Here is the solution: first introduce the probability space of all possible 
events 2 = {BG,GB, BB,GG} with P[{BG}] = P[{GB}] = P[{BB}] = 
P|{GG}] = 1/4. Let B = {BG, GB, BB} be the event that there is at least 
one boy and A = {GB, BG, GG} be the event that there is at least one 
girl. We have 


Definition. Two events A, B in s probability space (Q,.A,P) are called in- 
dependent, if 
P[AN B] = P[A]- P[B]. 


Example. The probability space 2 = {1,2,3,4,5,6} and p; = P[{i}] = 1/6 
describes a fair dice which is thrown once. The set A = {1,3,5 } is the 
event that ”the dice produces an odd number”. It has the probability 1/2. 
The event B = {1,2 } is the event that the dice shows a number smaller 
than 3. It has probability 1/3. The two events are independent because 
P[AN B] = P[{1}] = 1/6 = P[A]- P{B]. 


Definition. Write J Cy I if J is a finite subset of J. A family {Aj}ier of o- 
sub-algebras of A is called independent, if for every J C g T and every choice 
A; € Aj PIN je3 As] = [jes PIAy]- A family {Xj}; of random variables 
is called independent, if {o(Xj;)}jey are independent o-algebras. A family 
of sets {A;}jer is called independent, if the o-algebras A; = {0, Aj, Aj,2 } 
are independent. 
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Example. On 2 = {1,2,3,4 } the two o-algebras A = {0, {1,3 }, {2,4 },Q } 
and B = {O, {1,2 }, {3,4 },Q } are independent. 


Properties. (1) If a o-algebra F C A is independent to itself, then P[AN 
A] = P[{A] = P[A]? so that for every A € F, P[A] € {0,1}. Such a o-algebra 
is called P-trivial. 

(2) Two sets A, B € A are independent if and only if P[AN B] = P[A]-P[B]. 
(3) If A, B are independent, then A, B° are independent too. 

(4) If P[B] > 0, and A, B are independent, then P[A|B] = P[A] because 
P[A|B] = (P[A] - P[B])/P[B] = P(A]. 

(5) For independent sets A, B, the o-algebras A = {0, A, A°,Q} and B = 
{0, B, B°,Q} are independent. 


Definition. A family Z of subsets of 2 is called a a-system, if Z is closed 
under intersections: if A,B are in Z, then ANB is in TZ. A o-additive map 
from a 7-system T to [0, 00) is called a measure. 


Example. 1) The family TZ = {0, {1}, {2}, {3}, {1, 2}, {2,3}, Q} is a 7-system 
on Q = {1, 2,3}. 

2) The set Z = {[a, b) |0 << a<b< 1}U {0} of all half closed intervals is a 
m-system on 2 = [0,1] because the intersection of two such intervals [a, b) 
and [c, d) is either empty or again such an interval [c, b). 


Definition. We use the notation A, 7 A if A, C Anyi and U,, An = A. 
Let 2 be a set. (0, D) is called a Dynkin system if D is a set of subsets of 
Q satisfying 


(i) NEA, 
(ii) ABED,ACB>B\AED. 


(iii) A, € D, An 7 AS AED 


Lemma 2.1.2. (9,.A) is a o-algebra if and only if it is a 7-system and a 
Dynkin system. 


Proof. If Ais a o-algebra, then it certainly is both a 7-system and a Dynkin 
system. Assume now, A is both a z-system and a Dynkin system. Given 
A,B € A. The Dynkin property implies that AS = 2\ A, BS = Q\ B are 
in A and by the z-system property also AUB =2.\ (A°N B*) € A. Given 
a sequence A, € A. Define B, = Up_, Ar € A and A = U,, An. Then 
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A, 7 A and by the Dynkin property A € A. Also (),, An = Q\U,, AS € 2 
so that A is a o-algebra. 


Definition. If Z is any set of subsets of 2, we denote by d(Z) the smallest 
Dynkin system, which contains J and call it the Dynkin system generated 
by Z. 


Lemma 2.1.3. If Z is a 7- system, then d(Z) = o(Z). 


Proof. By the previous lemma, we need only to show that d(Z) is a 7—- 
system. 

(i) Define Di = {B € d(Z) | BNC € d(Z),VC € T }. Because T is a 
m-system, we have Z Cc Dj. 

Claim. D; is a Dynkin system. 

Proof. Clearly 2 € D,. Given A,B € D—1 with AC B. ForC eZ 
we compute (B \ A) NC = (BNC) \ (ANC) which is in d(Z). Therefore 
A\B€ Dy. Given A, 7 A with A, € D; andC € Z. Then A,NC 7 ANC 
so that ANC € d(Z) and A€ Dy. 

(ii) Define Dz = {A € d(Z) | BN A € d(Z),VB € d(Z) }. From () we know 
that J C Do. Like in (i), we show that Dz is a Dynkin-system. Therefore 
Dz = d(Z), which means that d(Z) is a 7-system. 0 


Lemma 2.1.4. (Extension lemma) Given a z-system T. If two measures pu, V 
on o(Z) satisfy 4(Q),v(Q) < oo and (A) = (A) for A ET, then p= v.. 


Proof. Proof of lemma (2.1.5). The set D = {A € o(Z) | u(A) = v(A) } 
is Dynkin system: first of all Q € D. Given A,B € D,A Cc B. Then 
u(B\A) = u(B)- (A) = v(B)—v(A) = v(B\ A) so that B\ A € D. Given 
A, € D with A, / A, then the o additivity gives y(A) = limsup,, u(An) = 
lim sup,, V(An) = (A), so that A € D. Since D is a Dynkin system con- 
taining the m-system Z, we know that o(Z) = d(Z) C D which means that 
f=vono(Z). O 


Definition. Given a probability space (0,.A,P). Two m-systems IZ, 7 C A 
are called P-independent, if for all A € J and B € 7, P[ANB] = P[A]-P[B]. 
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Lemma 2.1.5. Given a probability space (9,A,P). Let G,H be two o- 
subalgebras of A and Z and 7 be two a-systems satisfying o(Z) = G, 
o(J) = H. Then G and H are independent if and only if J and J are 
independent. 


Proof. (i) Fix I € T and define on (Q,H) the measures p(H) = P{IN 
H],v(H) = P[I|P[H] of total probability P[J]. By the independence of £ 
and J, they coincide on J and by the extension lemma (2.1.4), they agree 
on H and we have P[I 1 H] = P[J|P[H] for all J ¢ J and H € H. 

(ii) Define for fixed H € H the measures u(G) = P[G NM H] and u(G) = 
P(G]P[H] of total probability P[H] on (Q,G). They agree on T and so on G. 
We have shown that P[GN.H] = P[G]P[H] forallGeGandallHeH. O 


Definition. A is an algebra if A is a set of subsets of 2 satisfying 


(i) QE A, 
(ii) AE AS AS EA, 
(iii) AABE AS AUBEA 


A o-additive map from A to [0, 00) is called a measure. 


Theorem 2.1.6 (Carathéodory continuation theorem). Any measure on an 
algebra R has a unique continuation to a measure on o(7?). 


Before we launch into the proof of this theorem, we need two lemmas: 


Definition. Let A be an algebra and \: A+ [9,00] with A(0) = 0. A set 
A€ Ais called a A-set, if A(A MG) + A(A° 1G) = A(G) for all G € A. 


Lemma 2.1.7. The set A) of A-sets of an algebra A is again an algebra and 
satisfies )7;_, \(An IG) = A((Up_1 Az) AG) for all finite disjoint families 
{Ax }~_, and all Ge A. 


Proof. From the definition is clear that Q € A) and that if B € A), then 
B¢ € A). Given B,C € Ay. Then A= BNC € Aj. Proof. Since C € A), 
we get 

MCN AS NG) +A(C°N AS NG) = NASNG). 
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This can be rewritten with CN A° = CN BS and C°N AS = C“ as 
AMAT NG) = ACN BENG) +rA(C°NG). (2.1) 
Because B is a A-set, we get using BNC = A. 
MAN G) + A(BSNCNG)=XACNG). (2.2) 
Since C' is a A-set, we have 
MC NG) + A(C* NG) = AG). (2.3) 


Adding up these three equations shows that BMC is a A-set. We have so 
verified that A) is an algebra. If B and C are disjoint in A, we deduce 
from the fact that B is a A-set 


ABN(BUC)NG)+A(B°N(BUC)NG) =A(BUC)NG). 


This can be rewritten as A(BNG)+A(CNG) = A((BUC)NG). The analog 
statement for finitely many sets is obtained by induction. O 


Definition. Let A be a o-algebra. A map 4 : A — [0, 00] is called an outer 
measure, if 


A(O) = 0, 
A,B € Awith AC B= XA) < (B). 
An€A => AU, An) < ¥,, P[An] (o subadditivity) 


Lemma 2.1.8. (Carathéodory’s lemma) If \ is an outer measure on a mea- 
surable space (Q,.A), then A, C A defines a o-algebra on which \ is count- 
ably additive. 


Proof. Given a disjoint sequence A, € A). We have to show that A = 
U, An € Ay and A(A) = ¥, A(An). By the above lemma (2.1.7), By, = 
Une1 Ak is in Ay. By the monotonicity, additivity and the o -subadditivity, 
we have 


A(G) 


A(Bn 1G) + (BENG) > A(Bn NG) + A(AS NG) 
= ST MAR NG) + AAS NG) > AANG) + A(ASNG). 
k=1 


Subadditivity for A gives A(G) < A(ANG) + A(ASNG). All the inequalities 
in this proof are therefore equalities. We conclude that A € £ and that X 
is 0 additive on A). O 


We now prove the Caratheodory’s continuation theorem (2.1.6): 
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Proof. Given an algebra R with a measure p. Define A = o(R) and the 
o-algebra P consisting of all subsets of 2. Define on P the function 


A(A) = inf{}> LAr) | {An}nen sequence in R satisfying A C J An }. 


neN n 


(i) A is an outer measure on P. 

A(@) = 0 and (A) D A(B) for B > A are obvious. To see the o subad- 
ditivity, take a sequence A, € P with \(A,) < oo and fix € > 0. For all 
n € N, one can (by the definition of A) find a sequence {Bn x}xen in R 
such that An C Upen Bnk and open M(Bn,k) < A(An) + €27”. Define A = 
Unen An C Un keN Bn,k, 80 that A(A) < ar (Bre) < 32, A(An) + €. 
Since € was arbitrary, the o-subadditivity is proven. 


(ii) A = p on R. 

Given A € R. Clearly (A) < (A). Suppose that A C U,, An, with An € 
R. Define a sequence {Bn}nen of disjoint sets in R inductively By = Aj, 
Ba = AnO(Upen Az)*® such that By C Ap and U, Bn =U, An D A. From 
the o-additivity of 4 on R follows 


uA) < J u(Bn) < Ju(An) , 


so that p(A) > A(A). 


(iii) Let P be the set of A-sets in P. Then RC Py. 
Given A € R and G € P. There exists a sequence {Bn}nen in R such that 
GcU,, Bn and °,, u(Bn) < A(G) +. By the definition of A 


S> u(Bn) = 55 (AN Br) + $5 u(A° 9 Bn) > AANG) + (ASN G) 


because ANG C U, AN Bn and ASN G C YU, A°N Bn. Since € is ar- 
bitrary, we get A(A) > A(ANG) + A(A° NG). On the other hand, since 
A is sub-additive, we have also \(A) < A(ANG)+A(ACNG) and A is a d-set. 


(iv) By (i) A is an outer measure on (Q,P). Since by step (iii), R C Py, 
we know by Caratheodory’s lemma that A C P,, so that we can define yu 
on A as the restriction of \ to A. By step (iz), this is an extension of the 
measure ps on R. O 


Here is an overview over the possible set of subsets of 2. we have considered. 
We also include the notion of ring and o-ring, which is often used in measure 
theory and which differ from the notions of algebra or o-algebra in that 
Q does not have to be in it. In probability theory, those notions are not 
needed at first. For an introduction into measure theory, see [3, 37, 18]. 
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| m-system ss {_——=*|-finite intersections 

increasing countable union, difference 

Ping | 0 [complement and fie unions] 

countably many unions and complement 
algebra Q complement and finite unions 

0,2 countably many unions and complement 


Borel o-algebra o-algebra generated by the topology 


Remark. The name ”ring” has its origin to the fact that with the ” addition” 
A+ B= AQB = (AUB) \ (ANB) and ’multiplication” A- B= ANB, 
a ring of sets becomes an algebraic ring like the set of integers, in which 
rules like A-(B+C) = A-B+A-C hold. The empty set @ is the zero 
element because AAQ = A for every set A. If the set 2 is also in the ring, 
one has a ring with 1 because the identity AN Q = A shows that 2 is the 
l-element in the ring. 

Lets add some definitions, which will occur later: 


Definition. A nonzero measure 4s on a measurable space (2, .A) is called 
positive, if 4(A) > 0 for all A € A. If p*+,p~ are two positive measures 
so that u(A) = w* — y~ then this is called the Hahn decomposition of jj. 
A measure is called finite if it has a Hahn decomposition and the positive 
measure |js| defined by |u|(A) = ut (A) + u7(A) satisfies |4|((Q) < oo. 


Definition. Let v be a measure on the measurable space (Q,.A). We write 
vy << pif for every A in the o-algebra A, the condition u(A) = 0 implies 
v(A) = 0. One says that v is absolutely continuous with respect to p. 


2.2 Kolmogorov’s 0 — 1 law, Borel-Cantelli lemma 


Definition. Given a family {.A;}:e7 of o-subalgebras of A. For any nonempty 
set J Cc I, let Ay := Vj¢; Aj be the o-algebra generated by U,<7 Aj. 
Define also Ag = {0,9}. The tail o-algebra T of {A}ic, is defined as 
T =fic,r Ase, where J° =T\T. 


Theorem 2.2.1 (Kolmogorov’s 0 — 1 law). If {Ai}ier are independent o- 
algebras, then the tail o-algebra T is P-trivial: P[A] = 0 or P[A] = 1 for 
every AGT. 


Proof. (i) The algebras Ap and Ag are independent, whenever F,G Cc I 
are disjoint. 
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Proof. Define for H C I the m-system 


Ty ={A€A|A=() An K Cys HA: € Ai}. 
1€K 


The a-systems Ir and Tg are independent and generate the o-algebras Ar 
and Ag. Use lemma (2.1.5). 

(ii) Especially: Ay is independent of Aj- for every J C J. 

(iii) T is independent of Ay. 

Proof. T = (\jc,,; Aue is independent of any Ax for K cy I. It is 
therefore independent to the m-system Z; which generates A;. Use again 
lemma (2.1.5). 

(iv) T is a sub-o-algebra of A;. Therefore T is independent of itself which 
implies that it is P-trivial. O 


Example. Let X, be a sequence of independent random variables and let 


A={wen| S> Xn converges } . 


n=l 


Then P[A] = 0 or P{A] = 1. Proof. Because }>7-., Xn converges if and 
only if 5° y Xn converges, A € o(An, Anyi...) and so A € T, the 
tail o- algebra defined by the independent o-algebras A, = o(X7). If for 
example, X;, takes values +1/n, each with probability 1/2, then P[A] =0. 
If X,, takes values +1/n? each with probability 1/2, then P[A] = 1. As you 
might guess, the decision whether P[A] = 0 or P[A] = 0 is related to the 
convergence or divergence of a series. We will come back to that later. 


Example. Let {An}nen be a sequence of of subsets of (2. The set 


foe} 
Ago 1= lim sup An = () LU An 


m=ln>m 


consists of the set {w € 9} such that w € A, for infinitely many n € N. The 
set Ago is contained in the tail o-algebra of A, = {@,A, A°, Q}. It follows 
from Kolmogorov’s 0 — 1 law that P[A.] € {0,1} if An € A and {An} are 
P-independent. 


Remark. In the theory of dynamical systems, a measurable map T : 2 — Q 
of a probability space (Q,A,P) onto itself is called a K-system, if there 
exists a o-subalgebra F C A which satisfies F C o(T(F)) for which the 
sequence F, = o(T"(F)) satisfies Fy = A and which has a trivial tail 
g-algebra T = {@,}. An example of such a system is a shift map T(r)n = 
In41 on 2 = AN, where A is a compact topological space. The K-system 
property follows from Kolmogorov’s 0—1 law: take F = Ves T* (Fo), with 
Fo ={xeN=A* |x =reA}. 
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Theorem 2.2.2 (First Borel-Cantelli lemma). Given a sequence of events 
An € A. Then 

> P[An] < 00 > P[Ax] =0. 

neN 


Proof. P{Aso] = limn—oo PlUk>n Ax] < limps RK P[A;] = 0. 


Theorem 2.2.3 (Second Borel-Cantelli lemma). For a sequence A, € A of 
independent events, 


S= P[An] = 00 > P[Aco] = 1. 
nen 


Proof. For every integer n € N, 


Pf) Aél [] P48) 


lI 


k>n k>n 
= [JG -PIAk)) < [] exp(-PIAn)) 
k>n k>n 
= exp(- by P[Ax]) . 


k>n 
The right hand side converges to 0 for n > oo. From 


PIAS) = PIU) 1) 48] s © PI) 44] = 


ne€Nk>n né€N 
follows P[AS,] = 0. Go 


Example. The following example illustrates that independence is necessary 
in the second Borel-Cantelli lemma: take the probability space ((0, 1], 8, P), 

where P = dz is the Lebesgue measure on the Borel -algebra B of {0, 1]. 

For A, = [0,1/n] we get A.. = and so P[A,.] = 0. But because P[A,] = 
1/n we have 7°, P[An] = )o72, + = 00 because the harmonic series 
ye lin diverges: 


es a | 
ve2/ — dx = log(R). 
nal @ 1 & 


2.2. Kolmogorov’s 0 — 1 law, Borel-Cantelli lemma 37 


Example. (” Monkey typing Shakespeare”) Writing a novel amounts to en- 
ter a sequence of N symbols into a computer. For example, to write ” Ham- 
let”, Shakespeare had to enter N = 180'000 characters. A monkey is placed 
in front of a terminal and types symbols-at random, one per unit time, pro- 
ducing a random sequence X,, of identically distributed sequence of random 
variables in the set of all possible symbols. If each letter occurs with prob- 
ability at least «, then the probability that Hamlet appears when typing 
the first N letters is e%. Call A, this event and call A; the event that 
this happens when typing the (k — 1)N +1 until the kN’th letter. These 
sets A; are all independent and have all equal probability. By the second 
Borel-Cantelli lemma, the events occur infinitely often. This means that 
Shakespeare’s work is not only written once, but infinitely many times. Be- 
fore we model this precisely, lets look at the odds for random typing. There 
are 30% possibilities to write a word of length N with 26 letters together 
with a minimal set of punctuation: a space, a comma, a dash and a period 
sign. The chance to write ”To be, or not to be - that is the question.” 
with 43 random hits onto the keyboard is 1/10°*°. Note that the life time 
of a monkey is bounded above by 131400000 ~ 10° seconds so that it is 
even unlikely that this single sentence will ever be typed. To compare the’ 
probability probability, it is helpful to put the result into a list of known 
large numbers [10, 38]. 


10* One ”myriad”. The largest numbers, the Greeks were considering. 
10° The largest number considered by the Romans. 

10° The age of the universe in years. 

10’? The age of the universe in seconds. 

10? Distance to our neighbor galaxy Andromeda in meters. 

107° Number of atoms in two gram Carbon which is 1 Avogadro. 
10?” Estimated size of universe in meters. 

107° Mass of the sun in kilograms. 

10*_ Mass of our home galaxy ”milky way” in kilograms. 

10°! Archimedes’s estimate of number of sand grains in universe. 
108° The number of protons in the universe. 


19100 One ”googol”. (Name coined by 9 year old nephew of E. Kasner). 
10/53 Number mentioned in a myth about Buddha. 
10°” Size of ninth Fermat number (factored in 1990). 


10!°° Size of large prime number (Mersenne number, Nov 1996). 
7 
107° Years, ape needs to write "hound of Baskerville” (random typing). 


33 

107° Inverse is chance that a can of beer tips by quantum fluctuation. 
42 

10° Inverse is probability that a mouse survives on the sun for a week. 
50 

10° Estimated number of possible games of chess. 


1010"" Inverse is chance to find yourself on Mars by quantum fluctuations 
109° One ” Gogoolplex” 
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Lemma 2.2.4. Given a random variable X on a finite probability space A, 
there exists a sequence X,,X2,... of independent random variables for 
which all random variables X; have the same distribution as X. 


Proof. The product space 2 = AN is compact by Tychonov’s theorem. Let 
A be the Borel-o-algebra on 2 and let Qdenote the probability measure on 
A. The probability measure P = Q? is defined on (Q, A) has the property 
that for any cylinder set 


Z(w) = {WEO| we =Tk, Wet =Tk+1)+++;Wn =Tn } 


defined by a” word” w = [rx,,..-Tn]; 
P[Z(w)] = [[ Pe = nil = [] Q({ri}) - 
i=k i=k 


Finite unions of cylinder sets form an algebra R which generates o(R) = A. 
The measure P is o-additive on this algebra. By Carathéodory’s continu- 
ation theorem (2.1.6), there exists a measure P on (Q,.A). For this proba- 
bility space (02,A,P), the random variables X;(w) = w;) are independent 
and have the same distribution as X. , Oo 


Example. In the example of the monkey writing a novel, the process of 
authoring is given by a sequence of independent random variables X,(w) = 
Wr. The event that Hamlet is written during the time [Vk + 1, N(k + 1)] 
is given by a cylinder set A;,. They have all the same probability. By the 
second Borel-Cantelli lemma, P[A,.] = 1. The set Aoo, the event that the 
Monkey types this novel arbitrarily often, has probability 1. 


Remark. Lemma (2.2.4) can be generalized: given any sequence of prob- 
ability spaces (R,B,P;) one can form the product space (Q,.A,P). The 
random variables X;(w) = w; are independent and have the law P;. For an 
other construction of independent random variables is given in [105]. 


Exercice. In this exercise, we experiment with some measures on 2 = N 
[108]. 

a) The distance d(n,m) = |n — m| defines a topology O on 2 = N. What 
is the Borel o-algebra A generated by this topology? 

b) Show that for every A > 0 


2.3. Integration, Expectation, Variance 39 


is a probability measure on the measurable space (,.A) considered in a). 
c) Show that for every s > 1 


P(A] = > ¢(s)"in=* 


néeA 


is a probability measure on the measurable space (,.A). The function 


56) = 


nen 


is called the Riemann zeta function. 

d) Show that the sets A, = {n € Q| p divides n} with prime p are indepen- 
dent. What happens if p is not prime. 

e) Give a probabilistic proof of Euler’s formula 


1 1 
—~= |[ a--). 
¢(s) p prime P 
f) Let A be the set of natural numbers which are not divisible by a square 
different from 1. Prove i 
P[A] = ——. 
Al TR5) 


2.3 Integration, Expectation, Variance 


In this entire section, (2,.A,P) will denote a fixed probability space. 


Definition. A statement S about points w € 2 is a map from 2 to {true, false}. 
A statement is said to hold almost everywhere, if the set P[{w | S(w) = 
false }] = 0. For example, the statement "let X, —> X almost everywhere” , 
is a short hand notation for the statement that the set {x € Q| Xn(x) > 
X (x) } is measurable and has measure 1. 


Definition. The algebra of all random variables is denoted by CL. It is a 
vector space over the field R of the real numbers in which one can multiply. 
A elementary function or step function is an element of £ which is of the 


form P 
xX = S- ai: la, 
i=l 


with a; € R and where A; € A are disjoint sets. Denote by S the algebra 
of step functions. For X € S we can define the integral 


E[X] := [x ape 5 oPLAl 
i=1 


40 Chapter 2. Limit theorems 
Definition. Define £' C £ as the set of random variables X, for which 


sup ie dP 
YES,Y<|X| 


is finite. For X € £', we can define the integral or expectation 


EIX]= [ x aP = sup [ver- sup [va, 
YeS,y<xt YES, Y<X- 

where Xt = X V0 = max(X,0) and X- = ~X V0 = max(—X,0). The 

vector space L’ is called the space of integrable random variables. Similarly, 

for p > 1 write £? for the set of random variables X for which E[|X|?] < oo. 


Definition. It is custom to write L' for the space £’, where random vari- 
ables X,Y for which E||X — Y|] = 0 are identified. Unlike £?, the spaces 
L? are Banach spaces. We will come back to this later. 


Definition. For X € £7, we can define the variance 
Var[X] := E[(X — E[X])?] = E[X?] — E[X}. 
The nonnegative number 
o[X] = Var[X]}/? 
is called the standard deviation of X. 


The names expectation and standard deviation pretty much describe al- 
ready the meaning of these numbers. The expectation is the average”, 
"mean” or expected” value of the variable and the standard deviation 
measures how much we can expect the variable to deviate from the mean. 


Example. The m’th power random variable X (x) = x™ on ((0, 1], B, P) has 
the expectation 
1 


1 
a ies, 
BIx]= fs dz ma? 


the variance 


2 2_ 1 l — m" 
VariX) = EV BIA = ot Gm tly? + m)4( + 2m) 


m 


and the standard deviation o[X] = Guan Both the expectation 


as well as the standard deviation converge to 0 if m — oo. 


Definition. If X is a random variable, then E[X’] is called the m’th mo- 
ment of X. The m’th central moment of X is defined as E[(X — E[X])™]. 
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Definition. The moment generating function of X is defined as Mx(t) = 
Efex]. The moment generating function often allows a fast simultaneous 
computation of all the moments. The function 


«x (t) = log(Mx (t)) 


is called the cumulant generating function. 


Example. For X (x) = z on [0,1] we have both 


i 1) Gem tm 
Mx()/= fe az = eae eer 


m=1 


and 


Mx(t) = Ble*] = E[) aa" \= 3 moe I 
m=0 : 


m=0 


Comparing coefficients shows E[X™] = 1/(m + 1). 


Example. Let 2 = R. For given m € R,o > 0, define the probability 
measure P|[a, bj] = Sf. f(z) dz with 


cas ee 
f(z) errs 

This is a probability measure because after a change of variables y= 
(x—m)/(V/2c), the integral [™, f(x) dx becomes des e-Y” dy =1. The 
random variable X (x) = x on (Q, A, P) is a random variable with Gaussian 
distribution mean m and standard deviation o. One simply calls it a Gaus- 
sian random variable or random variable with normal distribution. Lets 
justify the constants m and a: the expectation of X is E[X] = f X dP = 
fe, 2f (x) dx = m. The variance is E[(X — m)?] = [& a2? f(x) dx = 0? 
so that the constant o is indeed the standard deviation. The moment gen- 
erating function of X is Mx(t) = emtto’t?/2 The cumulant generating 
function is therefore «x (t) = mt + o7t?/2. 


Example. If X is a Gaussian random variable with mean m = 0 and 
standard deviation a, then the random variable Y = e* has the mean 
E[e’] = e” /?. Proof: 


o- dy = eo /2 ee dy = ef /2 


1 1 
V 2710? V 270 


The random variable Y has the log normal distribution. 


Example. A random variable X € CL? with standard deviation o = 0 is a 
constant random variable. It satisfies X(w) = m for all w € 2. 
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Definition. If X € C? is a random variable with mean m and standard 
deviation o, then the random variable Y = (X —m)/o has the mean m = 0 
and standard deviation o = 1. Such a random variable is called normalized. 
One often only adjusts the mean and calls X — E[X] the centered random 
variable. 


Exercice. The Rademacher functions r,(x) are real-valued functions on 
(0, 1] defined by 


2k-1 2k 
ralZ)=4 Beg e tell 
-) <p < wi 


They are random variables on the Lebesgue space ((0, 1], A, P = dz). 

a) Show that 1-22 = >™_| ra(2) This means that for fixed x, the sequence 
Tn(x) is the binary expansion of 1 — 2z. 

b) Verify that r,(x) = sign(sin(272”~12)) for almost all z. 

c) Show that the random variables r,(z) on [0,1] are IID random variables 
with uniform distribution on {—1,1 }. 


d) Each r,(x) has the mean E[r,| = 0 and the variance Var[ry| = 1. 


Figure. The Figure. The Figure. The 
Rademacher Function Rademacher Function Rademacher Function 
r1(z) r2(z) r3(z) 


2.4 Results from real analysis 


In this section we recall some results of real analysis with their proofs. 
In the measure theory or real analysis literature, it is custom to write 
J f(z) du(z) instead of E[X] or f,g,h,... instead of X,Y, Z,..., but this 
is just a change of vocabulary. What is special about probability theory is 
that the measures yz are probability measures and so finite. 
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Theorem 2.4.1 (Monotone convergence theorem, Beppo Lévi 1906). Let Xn 
be a sequence of random variables in £! with 0 < X, < X2,..-. and assume 
X = limps Xn converges point wise. If sup, E[X,] < 00, then X € Cc. 
and 
E[X] = lim E[X,] . 
noo 


Proof. Because we can replace X, by Xn — X1, we can assume X, > 0. 
Find for each n a monotone sequence of step functions Xnm € S with 
Xn = sup, Xn,m. Consider the sequence of step functions 


Yn := sup Xkn < sup Xengi<S sup Xengi = Ynii- 
1<k<n 1<k<n 1<k<n+1 


Since Y, < sup?._,Xm = Xn also E[Y,] < E[X,]. One checks that 


sup, Yn = X implies sup, E[Y,] = supy, Sy<x E[Y] and concludes 


E[X]= sup E[Y] = supE(Y,] < supE[X,] < E[sup X,] = E[X] . 
YES, VS Xx Tr n n 


We have used the monotonicity E[X,] < E[X,4,] in sup, E[X,] = E[X]. 
O 


Theorem 2.4.2 (Fatou lemma, 1906). Let X, be a sequence of random 
variables in £! with |Xn| < X for some X € £L). Then 


Elim inf X,,] < lim inf E[X,] < limsup E[X,] < E[lim sup Roles 
20°. Mt OO n-+00 n—-00 


Proof. For p > n, we have 


inf Xm < Xp < sup Xm. 
m>n 


mon 


Therefore 
= B[ inf X,,] < E[X,] < E[sup Xn] . 
man m>n 


Because p > n was arbitrary, we have also 
E/ inf Xm] < inf E[X,] < sup E[X,] < E[sup X,] . 
m>n pon pon m>n 


Since Y, = infm>n Xm is increasing with sup, E[Y,] < oo and Z, = 
SUP,,>n Xm is decreasing with inf, E[Z,] > —oo we get from Beppo-Levi 
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theorem (2.4.1) that Y = sup, Y, = limsup, X, and Z = infnp Zp, = 
liminf,, X, are in £' and 


Efliminf X,] = supE| inf Xm] < sup inf E[Xm] = lim inf E[X,,] 
a n mon n mon n 


< limsupE[X,] = inf sup E[X,,] 
n nr m>n 


IA 


inf E[sup Xm] = E[limsup X,,] . 


man n 


Theorem 2.4.3 (Lebesgue’s dominated convergence theorem, 1902). Let Xn 
be a sequence in £L! with |X,| < Y for some Y € L!. If X,, > X almost 
everywhere, then E[X,,] — E[X]. 


Proof. Since X = liminf, X, = limsup, X, we know that X € CL} and 
from Fatou lemma (2.4.2) 


E[X] = Efliminf X,] < lim inf E[X,] 
lim sup E[X,,] < Ellim sup X,] = E[X] . 
n n 


lA 


O 


A special case of Lebesgue’s dominated convergence theorem is when Y = 
K is constant. The theorem is then called the bounded dominated conver- 
gence theorem. It says that E[X,,] — E[X] if X, < K and X, — X almost 
everywhere. 


Definition. Define also for p € (1, 00) the vector spaces £L? = {X € L||X/PE 
Li} and L° ={X € £| IK € RX < K, almost everywhere }. 


Example. For 2 = [0,1] with the Lebesgue measure P = dz and Borel 
a-algebra A, look at the random variable X(x) = x%, where a is a real 
number. Because X is bounded for a > 0, we have then X € L™. For 
a < 0, the integral E[|X|?] = if x°? dz is finite if and only if ap < 1 so 
that X is in £? whenever p > 1/a. 


2.5 Some inequalities 
Definition. A function h : R — R is called convex, if there exists for all 


xo € Ra linear map I(x) = ax+5 such that I(xo) = h(zo) and for alla €¢ R 
the inequality I(x) < h(x) holds. 
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Example. h(x) = x? is convex, h(x) = e® is convex, h(x) = 2x is convex. 
h(x) = —2? is not convex, h(x) = 2° is not convex on R but convex on 
Rt = (0, 00). 


Figure. The Jensen inequality in 


the case Q = {u,v}, Pl{u}] = h(a)| 
Pl{v}] = 1/2 and with X(u) = 
a, X(v) = b. The function h in h(b) 


this picture is a quadratic func- 
tion of the form h(x) = (z—s)?+ 
t 


h(EP) 
=h((a+b)/2) 
—t» -———_e 
a E[X]=(a+b)/2 b 


Theorem 2.5.1 (Jensen inequality). Given X € LC’. For any convex function 
h:R—R, we have 
E{h(X)] 2 R(E[X]) , 


where the left hand side can also be infinite. 


Proof. Let | be the linear map defined at x9 = E[X]. By the linearity and 
monotonicity of the expectation, we get 


h(E[X]) = U(E[X]) = Ell(X)] < E[A(X)] - 
O 


Example. Given p < q. Define h(x) = |z|?/?. Jensen’s inequality gives 
E[|X|?] = E[A(|X|?)] < A(E[]X|?] = E{|X|?]9/?. This implies that ||X||q := 
E[|X|9}'/4 < El|X|?]!/? = ||X||p for p < q and so 


EM CLP CL EL 


for p > q. The smallest space is £° which is the space of all bounded 
random variables. 


Exercice. Assume X is a nonnegative random variable for which X and 
1/X are both in £'. Show that E[X + 1/X] > 2. 
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We have defined L? as the set of random variables which satisfy E[|X|?] < 
oo for p € [1,00) and |X| < K almost everywhere for p = oo. The vector 
space £? has the semi-norm ||X||p = E[|X|P]!/? rsp. ||X|loo = inf{K € 
R | |X| < K almost everywhere }. 


Definition. One can construct from L? a real Banach space L? = L?/N 
which is the quotient of L? with NV = {X € L, | ||X||p = 0 }. Without this 
identification, one only has a pre-Banach space in which the property that 
only the zero element has norm zero is not necessarily true. Especially, for 
p = 2, the space L? is a real Hilbert space with inner product < X,Y >= 
E[XY]. 


Example. The function f(x) = 1g(x) which assigns values 1 to rational 
numbers z on (0, 1] and the value 0 to irrational numbers is different from 
the constant function g(x) = 0 in £?. But in L?, we have f = g. 

The finiteness of the inner product follows from the following inequality: 


-— eee 


Theorem 2.5.2 (Hélder inequality, Hélder 1889). Given p,q € [1, co] with 
p-'+q1=land X € LP and Y € £1. Then XY € £! and 


AY < XI Ipll¥ Iq - 


eee 


Proof. Without loss of generality, we can restrict the situation to X,Y >0 


and ||X'||, > 0. Define the probability measure 
X?P 
O= Fixe 


and define u = 14x 59}¥Y/X?-!. Jensen's inequality gives Q(u)? < Q(u’) so 
that 
E(|XY |} < [|X lplltezsop¥ lla < |X lpll¥ lle - 


A special case of Hélder’s inequality is the Cauchy-Schwarz inequality 


|X¥ 1 < [Xl l2- [IY Ile - 


The semi-norm property of £? follows from the following fact: 
ee 


Theorem 2.5.3 (Minkowski inequality (1896)). Given p € [1,00] and X,Y € 
L?. Then 
|X +¥|Ip S |]X lp + [I¥ lp - 
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Proof. We use Hélder’s inequality from below to get 


E[|X +Y|?] = E[X||X+YP'}+ElIV IX +YP] <XlpC + IY lee 
where C = |||X +Y|?~"||q = E[|X + Y|?]!/¢ which leads to the claim. O 


Definition. We use the short-hand notation P[X > c] for P[{w € 2 | X(w) = 
c }}). 


o_O 


Theorem 2.5.4 (Chebychev-Markov inequality). Let h be a monotone func- 
tion on R with h > 0. For every c > 0, and h(X) € L' we have 


h(c) -P[X >] < E[h(X)] . 


i 


Proof. Integrate the inequality h(c)1x>< < h(X) and use the monotonicity 
and linearity of the expectation. Oo 


Figure. The proof of the 
Chebychev-Markov inequality in 
the case h(x) = x. The left hand 
side h(c)-P[X > c} is the area of 
the rectangles {X > c} x [0,h(x)| 
and E{h(X)| = E[X] is the area 
under the graph of X. 


Example. /i(x) = |2| leads to P{|X| > ¢| < ||X||1/e which implies for 
example the statement 


E[|X|] =0 > P[X =0] =1. 


Exercice. Prove the Chernoff bound 
P[X > c| < inf,>o0 eM x(t) 


where Mx (t) = Ele*‘] is the moment generating function of X. 
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An important special case of the Chebychev-Markov inequality is the Cheby- 
chev inequality: 


Theorem 2.5.5 (Chebychev inequality). If X € L?, then 


Var[X] 
eit 


PI|X — E[X]| >] < 


Proof. Take h(x) = x? and apply the Chebychev-Markov inequality to the 
random variable Y = X — E[X] € L? satisfying h(Y) € £}. Oo 


Definition. For X,Y € L? define the covariance 
Cov[X, Y] := E[(X — E[X])(Y — E[Y])] = E[XY] - ELX]E[Y]. 
Two random variables in £? are called uncorrelated if Cov[X, Y] = 0. 


Example. We have Cov[X, X] = Var[X] = E[(X — E[X])?] for a random 
variable X € L?. - 


Remark. The Cauchy-Schwarz-inequality can be restated in the form 
|Cov[X, Y]| < o[X]o[Y] 


Definition. The regression line of two random variables X,Y is defined as 
y = az +b, where 


_ Cov[X,Y] 7 
= Vax Os EIF]— aBIX]. 


If Q = {1,...,n } isa finite set, then the random variables X,Y define the 
vectors 


X = (X(1),...,X(n)), ¥ = (¥(1),...,¥(n) 
or n data points (X(z), Y(z)) in the plane. As will follow from the proposi- 


tion below, the regression line has the property that it minimizes the sum 
of the squares of the distances from these points to the line. 


2.5. Some inequalities 49 


Figure. Regression line com- 
puted from a finite set of data 
points (X (i), Y(2)). 


Example. If X,Y are independent, then a = 0. It follows that b = E[Y]. 
Example. If X = Y, then a = 1 and b = 0. The best guess for Y is X. 


Proposition 2.5.6. If y = az + b is the regression line of of X,Y, then the 
random variable Y = aX +6 minimizes Var[Y — Y] under the constraint 
E[Y] = E[Y] and is the best guess for Y, when knowing only Hy and 
Cov[X, Y]. We check Cov[X, Y] = Cov[X, Y]. 


Proof. To minimize Var[aX +b—Y] under the constraint E[{aX+b—Y] = Ois 
equivalent to find (a,b) which minimizes f (a,b) = E[(aX + b—Y)?] under 
the constraint g(a,b) = E[aX + b— Y] = 0. This least square solution 
can be obtained with the Lagrange multiplier method or by solving b = 
E[Y]—aE[X] and minimizing h(a) = E[(aX -Y—E[aX —Y])?] = a?(E[X?]— 
E[X]?) -2a(E[XY]—E[X]E[Y]) = a?Var[X]—2aCov[X, Y]. Setting h’(a) = 
0 gives a = Cov[X, Y]/Var[X]. 


Definition. If the standard deviations o[X],o[Y] are both different from 
zero, then one can define the correlation coefficient 


Cov[X, Y] 


Corr[X, Y] = oiXoi¥] 


which is a number in [—1,1]. Two random variables in C? are called un- 
correlated if Corr[X,Y] = 0. The other extreme is |Corr[X,Y]| = 1, then 
Y =aX +5 by the Cauchy-Schwarz inequality. 


Theorem 2.5.7 (Pythagoras). If two random variables X,Y € L? are 
independent, then Cov[X,Y] = 0. If X and Y are uncorrelated, then 
Var[X + Y] = Var[X] + Var[Y]. 
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Proof. We can find monotone sequences of step functions 
n n 
Xn = So aila, > X Yn = >) Bj 1a, 9. 
i=l j=1 


We can choose these functions in such a way that A; € A = o(X) and 
B; € B=o0(Y). By the Lebesgue dominated convergence theorem (2.4.3), 
E[X,] — E[X] and E[Y,] — E[Y] almost everywhere. Compute X, - 
Y, = Di j=l ai8;14,1B;- By the Lebesgue dominated convergence theo- 
rem (2.4.3) again, E[X,Y,] —~ E[XY]. By the independence of X,Y we 
have E[XnYn] = E[Xn] - E[Y,] and so E[XY] = E[X]E[Y] which implies 
Cov[X, Y] = E[XY] — E[X]- E[Y] = 0. 

The second statement follows from 


Var[X + Y] = Var[X] + Var[Y] + 2 Cov[X,Y] . 
O 


Remark. If 2 is a finite set, then the covariance Cov[X, Y] is the dot prod- 
uct between the centered random variables X — E[X] and Y — E[Y], and 
o[X] is the length of the vector X — E[X] and the correlation coefficient 
Corr[X,Y] is the cosine of the angle a between X — E[X] and Y — E[Y] 
because the dot product satisfies #-w = |0||w|cos(a). So, uncorrelated 
random variables X,Y have the property that X — E[X] is perpendicular 
to Y — E[Y]. This geometric interpretation explains, why lemma (2.5.7) is 
called Pythagoras theorem. 


For more inequalities in analysis, see the classic [29, 58]. We end this sec- 
tion with a list of properties of variance and covariance: 


= E[X?] — E[X]?. 
= \?Var[X]. 
= Var[X] + Var[Y] + 2Cov[X, Y]. Corr[X, Y] € (0, 1]. 


= E[XY] — E[X]E[Y]. 
Cov[X, Y] < o[X]o[Y]. 
Corr[X,Y] = 1 if X —E[X]=Y -E[¥] 


2.6 The weak law of large numbers 


Consider a sequence X1, X2,... of random variables on a probability space 
(Q,.A,P). We are interested in the asymptotic behavior of the sums S, = 
X,+ Xo+---+ Xp, for n — o6 and especially in the convergence of the 
averages S,/n. The limiting behavior is described by ”laws of large num- 
bers”. Depending on the definition of convergence, one speaks of ” weak” 
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and ”strong” laws of large numbers. 
We first prove the weak law of large numbers. There exist different ver- 


sions of this theorem since more assumptions on X,, can allow stronger 
statements. 


Definition. A sequence of random variables Y,, converges in probability to 
a random variable Y, if for all « > 0, 


lim P[lY,-Y] >< =0. 
n—-cCoO 


One calls convergence in probability also stochastic convergence. 


Remark. If for some p € [1,00), ||Xn — X||p — 0, then Xn, — X in 
probability since by the Chebychev-Markov inequality (2.5.4), P[|Xn—X| 2 
e) SIX — Xall?/e. 


Exercice. Show that if two random variables X,Y € LC” have non-zero 
variance and satisfy |Corr(X,Y)| = 1, then Y = aX + 6 for some real 
numbers a, b. 


Theorem 2.6.1 (Weak law of large numbers for uncorrelated random vari- 
ables). Assume X; € CL? have common expectation E[X;] = m and satisfy 
sup, + Doj=1 Var[Xi] < oo. If Xp are pairwise uncorrelated, then Sa =m 
in probability. 


Proof. Since Var[X + Y] = Var[X] + Var[Y] + 2- Cov[X,Y] and X, are 
pairwise uncorrelated, we get Var[Xn + Xm] = Var[Xn] + Var[Xm] and by 
induction Var[Sn] = S>7_, Var[Xn]. Using linearity, we obtain E[S,,/n] =m 
and 


Var[ 22] = E = a) EIS) = Yala! _ 


2S Var[Xn] . 


The right hand side converges to zero for n — oo. With Chebychev’s in- 
equality (2.5.5), we obtain 


Sn Var[ 5] 
—- >Pe< mo, 
PIS — ml >< 
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As an application in analysis, this leads to a constructive proof of a theorem 
of Weierstrass which states that polynomials are dense in the space C0, 1] 
of all continuous functions on the interval [0,1]. Unlike the abstract Weier- 
strass theorem, the construction with specific polynomials is constructive 
and gives explicit formulas. 


Figure. Approximation of a 
function f(x) by Bernstein poly- 
nomials Ba, Bs, Bio, Boo, Bao. 


Theorem 2.6.2 (Weierstrass theorem). For every f € C[0, 1], the Bernstein 


polynomials 
“kL fon ne 
Buta) = 4G) ( j )ata-ar* 


converge uniformly to f. If f(x) > 0, then also B,(x) > 0. 


Proof. For x € [0,1], let X, be a sequence of independent {0,1}- valued 
random variables with mean value zx. In other words, we take the proba- 
bility space ({0,1}",.A,P) defined by Plwn = 1] = z. Since P[S, = k] = 


( : ) p*(1—p)"-*, we can write B,(z) = E[f(*2)]. We estimate 


BL=)) - Fl < wii r(2) - Fee) 


n 


|Bn(x) — f(2)| 


< 2Ifll- PIR - 2] > 6) 
+ sup [f(«) — f(y)|-P[|=2 — 2] <4] 
|z—y|<6 n 


< fil PUI — 2} > 4 
+ sup |f(x) — f(y)|. 
|jz—y|<6 
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The second term in the last line is called the continuity module of f. It 
converges to zero for 6 — 0. By the Chebychev inequality (2.5.5) and the 
proof of the weak law of large numbers, the first term can be estimated 
from above by 


Var[X;] 
2 
Illa 
a bound which goes to zero for 2 — oo because the variance satisfies 
Var[X;] = a(1 — x) < 1/4. O 


In the first version of the weak law of large numbers theorem (2.6.1), we 
only assumed the random variables to be uncorrelated. Under the stronger 
condition of independence and a stronger conditions on the moments (X* € 
L'), the convergence can be accelerated: 


Theorem 2.6.3 (Weak law of large numbers for independent L* random 
variables). Assume X; € £* have common expectation E[X;] = m and 
satisfy M = sup,, ||X|la < oo. If X; are independent, then S,/n — m in 
probability. Even )>°° , P[|22 — m| >] converges for alle > 0. 


Proof. We can assume without loss of generality that m = 0. Because the 
X;, are independent, we get 


E[S4] = E[X;, Xi. Xie Xia] - 
n 1 2 3 4 


41, 12,23,t4=1 


Again by independence, a summand E[X;, X;,Xi,Xi,] is zero if an index 
i = i occurs alone, is E[X}] if all indices are the same and E[X?]E[X?], if 
there are two pairwise equal indices. Since by Jensen’s inequality ELX. Fall < 
E[X#] < M we get 

E[S4] <nM+n(n+1)M. 


Use now the Chebychev-Markov inequality (2.5.4) with h(x) = x‘ to get 


Sn E[(S,,/n)4 
pmej>q x HGa/ny 
n € 
n+n? 1 
< Mar S Maa. 


O 


We can weaken the moment assumption in order to deal with £! random 
variables. Of course, the assumptions have to be made stronger at some 
other place. 
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Definition. A family {X;}:¢7 of random variables is called uniformly inte- 
grable, if sup;¢; E[1|x,;>R] — 0 for R — oo. A convenient notation which 
we will quite often use in the future is E[14X] = E[X; A] for X € L* and 
AEA. 


Theorem 2.6.4 (Weak law for uniformly integrable, independent L1 random 
variables). Assume X; € L’ are uniformly integrable. If X; are indepen- 
dent, then + 7? ,(Xm — E[Xm]) — 0 in £’ and therefore in probability. 


Proof. Without loss of generality, we can assume that E[X,] = 0 for all 
n €N, because otherwise X,, can be replaced by Y,, = Xn — E[X,]. Define 
fr(t) = tll_R,n), the random variables 


Xi” = fr(Xn) — Elfr(Xn)], YL =X, -— XL 
as well as the random variables 
MoS ye poll py@ 
st sees. , T) = =o , 
i=l i=1 
We estimate, using the Minkowski and Cauchy-Schwarz inequalities 


[|Sn|l1 SP + [ITP hn 
S92 +2 sup El|Xi|;|Xi| > R] 
l<l<n 


IA IA 


R 
— + 2supE[|X7|; |X| > R] . 
evo [Xi]; |Xi] = Rl 


In the last step we have used the independence of the random variables and 
BX!” = 0 to get 


E((Xn)"] . FP 
Ele TE 


SQ |2 = E[(sf)?] = — 
The claim follows from the uniform integrability assumption 
supjen E[| Xz]; |Xi| > R] — 0 for R — 00 Q 


A special case of the weak law of large numbers is the situation, where all 
the random variables are IID: 


Theorem 2.6.5 (Weak law of large numbers for IID L’ random variables). 
Assume X; € £! are IID random variables with mean m. Then Sp /jn-m 
in £! and so in probability. 
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Proof. We show that a set of IID £’ random variables is uniformly inte- 
grable: given X € L!, we have K - P[|X| > K] < ||X||1 so that P[|X| > 
K] — 0 for K — ov. 


Because the random variables X; are identically distributed, P[|X;|; |Xi| = 
R] is independent of 1. Consequently any set of IID random variables is also 
uniformly integrable. We can now use theorem (2.6.4). Oo 


Example. The random variable X(z) = x? on [0,1] has the expectation 
m = E[X] = iB x? dr = 1/2. For every n, we can form the sum S,/n = 
(x? + 23+--.+22)/n. The weak law of large numbers tells us that P[|Sn — 
1/2| > €] — 0 for n — oo. Geometrically, this means that for every € > 0, 
the volume of the set of points in the n-dimensional cube for which the 
distance r(x1,..,2n) = /z? ++: +22 to the origin satisfies \/n/2—€ < 
r < /n/2 + converges to 1 for n — oo. In colloquial language, one 
could rephrase this that asymptotically, as the number of dimensions to go 
infinity, most of the weight of a n-dimensional cube is concentrated near a 
shell of radius 1/\/2 ~ 0.7 times the length \/n of the longest diagonal in 
the cube. 


Exercice. Show that if X,Y € £! are independent random variables, then 
XY €¢€ L£’. Find an example of two random variables X,Y € L} for which 
XY ¢L'. 


Exercice. a) Given a sequence py € [0, 1] and a sequence X,, of IID random 
variables taking values in {—1,1} such that P[Xn = 1] = pn and P[Xp = 
—1] = 1— pp. Show that 


1 n 
Pa wee re mg) —_ 0 
k=1 


in probability, where my, = 2p, — 1. 

b) We assume the same set up like in a) but this time, the sequence pp is 
dependent on a parameter. Given a sequence X,, of independent random 
variables taking values in {—1,1} such that P[X, = 1] = pn and P[Xn = 
—1] =1—p, with p, = (1+ cos|@ + na])/2, where @ is a parameter. Prove 
that } > X, — 0 in L’ for almost all @. You can take for granted the fact 
that + )°,_1 Pk > 1/2 for almost all real parameters 6 € [0, 27] 


Exercice. Prove that X, — X in Sa then there exists of a subsequence 
Y, = Xn, satisfying Y, — X almost everywhere. 
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Exercice. Given a sequence of random variables X,,. Show that X,, con- 
verges to X in probability if and only if 


|Xn — X| 
Tom x]! 7 


for n > oo. 


Exercice. Give an example of a sequence of random variables X, which 
converges almost everywhere, but not completely. 


Exercice. Use the weak law of large numbers to verify that the volume of 
an n-dimensional ball of radius 1 satisfies V, — 0 for n — oo. Estimate, 
how fast the volume goes to 0. (See example (2.6)) 


2.7 The probability distribution function 


Definition. The law of a random variable X is the probability measure pon 
R defined by 4(B) = P[X~1(B)] for all B in the Borel o-algebra of R. The 
measure y is also called the push-forward measure under the measurable 
map X:Q—R. 


Definition. The distribution function of a random variable X is defined as 
Fx(s) = u((-00,5]) = P[X <3}. 


The distribution function is sometimes also called cumulative density func- 
tion (CDF) but we do not use this name here in order not to confuse it 
with the probability density function (PDF) fx(s) = F(s) for continuous 
random variables. 


Remark. The distribution function F is very useful. For example, if X is a 
continuous random variable with distribution function F', then Y = F(X) 
has the uniform distribution on [0, 1]. We can reverse this. If we want to pro- 
duce random variables with a distribution function F, just take a random 
variable Y with uniform distribution on (0, 1] and define X = F-1(Y). This 
random variable has the distribution function F because {X € [a, 6] } = 
{F-(Y) € [a,b] } = {¥ € F((a,b)) } = {Y € (F(a), F(D)}} = F(b) ~ F(a). 
We see that we need only to have a random number generator which pro- 
duces uniformly distributed random variables in [0,1] to produce random 
variables with a given continuous distribution. 
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Definition. A set of random variables is called identically distributed, if 
each random variable in the set has the same distribution function. It is 
called independent and identically distributed if the random variables are 
independent and identically distributed. A common abbreviation for inde- 
pendent identically distributed random variables is IID. 


Example. Let 2 = [0,1] be the unit interval with the Lebesgue measure pu 
and let m be an integer. Define the random variable X (x) = x™. One calls 
its distribution a power distribution. It is in £’ and has the expectation 
E[X] = 1/(m + 1). The distribution function of X is Fx(s) = s@/™ on 
(0, 1] and Fx(s) = 0 for s < 0 and Fx(s) = 1 for s > 1. The random 
variable is continuous in the sense that it has a probability density function 
fx(s) = Fx(s) = s\/"-1/m so that Fx(s) = f°. fx(t) dt. 


-0.2 0.2 0.4 0.6 0.8 1 1.2 


Figure. The distribution function Figure. The density function 
Fx(s) of X(x) = x™ in the case fx(s) of X(x) = x™ in the case 
m= 2. m = 2. 


Given two IID random variables X,Y with the m’th power distribution as 
above, we can look at the random variables V = X+Y,W = X-—Y. Onecan 
realize V and W on the unit square Q = (0, 1] x [0,1] by V(z,y) =2™+y™ 
and W(x,y) =x” —y”™. The distribution functions Fy(s) = P[V < s] and 
Fw(s) = P[V < s] are the areas of the set A(s) = {(x,y) |z™ +y™" <8} 
and B(s) = {(r,y)|2"-—y™<s}. 
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Figure. F\(s) is the area of the 
set A(s), shown here in the case 


m= 4. 
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Figure. Fy (s) is the area of the 
set B(s), shown here in the case 
m =4. 


We will later see how to compute the distribution function of a sum of in- 
dependent random variables algebraically from the probability distribution 
function F'y. From the area interpretation, we see in this case 


1/m 


F(a) = Jo (s- 4 
v(s) 1 fo, sym 1 — (8-2) /™ de, 


s 


and 


a 1 


| 


Figure. The function Fy(s) with 
density (dashed) fy(s) of the sum 
of two power distributed random 
variables with m = 2. 


o.4 


im dr s € (0,1), 

s € [1,2] 

—(2"—s)/"™dz, s€[-1,0] 
—(2™—s)/™dr, s€ [0,1] 


Figure. The function Fw(s) with 
density (dashed) fw(s) of the dif- 
ference of two power distributed 
random variables with m = 2. 
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Exercice. a) Verify that for 9 > 0 the Maxwell distribution 
4 
f(a) = F08ltate te 


is a probability distribution on Rt = (0,00). This distribution can model 
the speed distribution of molecules in thermal equilibrium. 
a) Verify that for @ > 0 the Rayleigh distribution 


2 


f(x) = 20re—™ 


is a probability distribution on Rt = (0,00). This distribution can model 
the speed distribution VX? + Y? of a two dimensional wind velocity (X, Sy); 
where both X,Y are normal random variables. 


2.8 Convergence of random variables 


In order to formulate the strong law of large numbers, we need some other 
notions of convergence. 


Definition. A sequence of random variables X,, converges in probability to 
a random variable X, if 


P(X, —X|>q—0 


for all « > 0. 


Definition. A sequence of random variables X, converges almost every- 
where or almost surely to a random variable X, if P[X, — X] =1. 


Definition. A sequence of CL? random variables X,, converges in L? to a 
random variable X, if 
||Xn — X||p + 0 


for n — oo.. 


Definition. A sequence of random variables X,, converges fast in probabil- 
ity, or completely if 


So P|Xn — X| > < 00 


for all e > 0. 
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We have so four notions of convergence of random variables X, — X, if the 
random variables are defined on the same probability space (Q, A, P). Later 
we will the two equivalent but weaker notions convergence in distribution 
and weak convergence, which not necessarily assume X, and X to be de- 
fined on the same probability space. Lets add these two definitions also 
here. We will see later, in theorem (2.13.2) that the following definitions 
are equivalent: 


Definition. A sequence of random variables X, converges in distribution, 
if Fx, (x) — Fx (2) for all points s, where Fy is continuous. 


Example. Let 2, = {1,2,...,n} with the uniform distribution P[{k}] = 1/n 
and X,, the random variable X,,(z) = x/n. Let X (x) = x on the probability 
space [0,1] with probability P[[a, b)] = b-—a. The random variables X, and 
X are defined on a different probability spaces but X, converges to X in 
distribution for n — oo. 


Definition. A sequence of random variables X,, converges in law to a ran- 
dom variable X, if the laws pn of X, converge weakly to the law yp of 
XxX. 


Remark. In other words, X,, converges weakly to X if for every continuous 
function f on R of compact support, one has 


/ f(2) dita) > / fla) du(e) 
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ee 


Proposition 2.8.1. The next figure shows the relations between the different 
convergence types. 


eee 


0) In distribution = in law 
Fx,,(s) > Fx(s), Fx cont. at s 


1) In probability 
P[|Xn — X| >] -0,Ve > 0. 


2) Almost everywhere 
P[Xn — X]=1 


4) Complete 
dn P[|Xn — X| > €] < co, Ve > 0 


Proof. 2) => 1): Since 
k mn>m 
"almost everywhere convergence” is equivalent to 
1=PIU) 1) (x, -X|<z}J= jim n P[() {IXn -X|<7}] 
mn>m n>m 


for all k. Therefore 


Pi|Xm —X| 2 < P[(){|Xn-X]>€}j/ 0 


n>m 


for all € > 0. 
4) = 2): The first Borel-Cantelli lemma implies that for all « > 0 


P[|Xn — X| >, infinitely often] =0. 
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We get so for €n —> 0 

PIU |X,—X| > ex, infinitely often] < So P(|Xn—X| > ex, infinitely often] 
n n 

from which we obtain P[X, — X] = 1. 

3) > 1): Use the Chebychev-Markov inequality (2.5.4), to get 


= Pp 
PIX, — x] >¢)< PAP 


0 


Example. Here is an example of convergence in probability but not almost 
everywhere convergence. Let ([0,1],A,P) be the Lebesgue measure space, 
where J is the Borel o-algebra on [0, 1]. Define the random variables 


Xn k = Ltka-™ (k+1)2-"]) r= 1,2, sees k= 0,. . ig 2 —l. 


By lexicographical ordering X, = X11,X2 = X21,X3 = X2,2,X4 = 
X2,3,... we get a sequence Xp satisfying 


lim inf X,(w) = 0, lim sup X,(w) =1 


N-7>0O 


but P[|Xnk >] $27”. 


Example. And here is an example of almost everywhere but not £? con- 
vergence: the random variables 


Xn = 2" 110,2-"] 


on the probability space ({0,1],A,P) converge almost everywhere to the 
constant random variable X = 0 but not in £? because ||Xnl|p = gulp=1)/P 


With more assumptions other implications can hold. We give two examples. 


a 


Proposition 2.8.2. Given a sequence X, € L® with ||Xn|loo < K for all n, 
then X,, — X in probability if and only if X, — X in i. 


a Be pe a ss 
Proof. (i) P{|X| < K) =1. Proof. For k € N, 
1 1 
P{|X| >K+7) < P[|X — Xn| > ra —+0,n— co 
so that P[|X| > K + ¢] = 0. Therefore 


PIIX| > K]=PI_J{Ix|>K+ : }] =0. 
k 
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(ii) Given € > 0. Choose m such that for all n > m 


€ € 
Pl|Xn—X|> 3]<35- 


Then, using (7) 
El|Xn—-X]) = El(|Xn — X|;|Xn- X| >] +E[(|Xn — X|;|Xn — X|< J 
< 2KP|X, —X|> git 5 26s 


O 
Definition. A family C C C! of random variables is called uniformly inte- 
grable, if 
lim sup E[1l|x)3r] > 0 
R- 00 x eC 


for all X € C. The next lemma was already been used in the proof of the 
weak law of large numbers for IID random variables. 


ee 


Lemma 2.8.3. Given X € £' and « > 0. Then, there exists K > 0 with 
E[|X|;|X| > K] <e. 


——_—_ —,——w“—oqoiiina— 


Proof. Given € > 0. If X € L', we can find 6 > 0 such that if PIA] < 6, 
then E||X|, A] < ¢. Since KP[|X| > K] < E[|X|], we can choose K such 
that P[|X| > K) < 6. Therefore E[|X|;|X| > K] <e. D 


The next proposition gives a necessary and sufficient condition for £! con- 
vergence. 


ee 


Proposition 2.8.4. Given a sequence random variables X,, € L!. The fol- 
lowing is equivalent: 

a) Xn converges in probability to X and {Xn}nen is uniformly integrable. 
b) X, converges in £L' to X. 


Proof. a) => b). Define for K > 0 and a random variable X the bounded 
variable 


X= Xl wexcky + K-Mxsxjy-K- 1{x<-K}- 


By the uniform integrability condition and the above lemma (2.8.3) 
can choose K such that for all n, 


, we 


EX — x, |] < = EX‘) — x] < 7 
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Since |X{*) — x(*)| < |X, — X|, we have x) _, x) in probability. 
We have so by the last proposition (2.8.2) XxX) _, x(*) in £} so that for 
n>m EX? — X‘K)|] < €/3. Therefore, for n > m also 


E(|Xn — X|] < E(\Xn — XO] + BI XM) - x] + EX — XI] <e. 


a) => b). We have seen already that X, — X in probability if ||Xn—X||1 > 
0. We have to show that X, — X in L' implies that X, is uniformly 
integrable. 
Given ¢ > 0. There exists m such that E[|X, — X|] < €/2 for n > m. By 
the absolutely continuity property, we can choose 6 > 0 such that P[A] <e 
implies 

E[|Xn|; A] < 6,1 <n < m,E[|X|; A] < €/2. 
Because X,, is bounded in £', we can choose K such that K~' sup, E[|Xn|] < 
6 which implies E[|X,,| > K] < 6. For n > m, we have therefore 


E(|Xnl; |Xn| > K] < E||X]; Xn > K] + E||X = Xn] <eé. 


Exercice. a) P[sup,s,, |Xz — X| > €] 7 0 for n — oo and alle > 0 if and 
only if X, — X almost everywhere. 
b) A sequence X,, converges almost surely if and only if 
lim P{sup|Xn+% — Xn| > €] =0 
1 


W=tOO. Uke 


for all « > 0. 


2.9 The strong law of large numbers 


The weak law of large numbers makes a statement about the stochastic 
convergence of sums 

Sn a Xx atert Xn 

no n 
of random variables X,. The strong laws of large numbers make analog 
statements about almost everywhere convergence. 


The first version of the strong law does not assume the random variables to 
have the same distribution. They are assumed to have the same expectation 
and have to be bounded in £*. 


EEE 


Theorem 2.9.1 (Strong law for independent L}-random variables). Assume 
X, are independent random variables in CL’ with common expectation 
E[X,] = m and for which M = sup, ||Xn|\4 < 00. Then S,/n — m almost 
everywhere. 


i i—ii«< 
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Proof. In the proof of theorem (2.6.3), we derived 


on 1 
P[|— — m| > e] < 2M@—~. 
[2 - m| >< 2M 
This means that S,,/n — m converges completely. By proposition (2.8) we 
have almost everywhere convergence. O 


Here is an application of the strong law: 


Definition. A real number z € [0,1] is called normal to the base 10, if its 
decimal expansion x = 1,22... has the property that each digit appears 
with the same frequency 1/10. 


Corollary 2.9.2. (Normality of numbers) On the probability space 
((0, 1], B, Q = dx), Lebesgue almost all numbers z are normal. 


Proof. Define the random variables X,(z) = 2n, where x, is the n’th 
decimal digit. We have only to verify that X,, are IID random variables. The 
strong law of large numbers will assure that almost all x are normal. Let Q = 
{0,1,...,9 }% be the space of all infinite sequences w = (w1,w2,w3,...). 
Define on 2 the product o-algebra A and the product probability measure 
P. Define the measurable map S(w) = °°, w,/10* = x from 2 to (0, 1]. 
It produces for every sequence in 2. a real number z € (0, 1]. The integers 
Wk are just the decimal digits of x. The map S is measure preserving and 
can be inverted on a set of measure 1 because almost all real numbers have 
a unique decimal expansion. 

Because X,,(z) = Xn(S(w)) = Yn(w) = wn, if S(w) = x. We see that Xp 
are the same random variables than Y,,. The later are by construction IID 
with uniform distribution on {0,1,...,9 }. 0 


Remark. While almost all numbers are normal, it is difficult to decide 
normality for specific real numbers. One does not know for example whether 
mw —3 = 0.1415926... of /2—1=0.41421... is normal. 

The strong law for IID random variables was first proven by Kolmogorov 
in 1930. Only much later in 1981, it has been observed’ that the weaker 
notion of pairwise independence is sufficient [25]. 


Theorem 2.9.3 (Strong law for pairwise independent L) random variables). 
Assume X, € L’ are pairwise independent and identically distributed ran- 
dom variables. Then S,,/n — E[X,] almost everywhere. 
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Proof. We can assume without loss of generality that X, > 0 (because we 
can split X, = X;* + Xz into its positive X77> = X, V0 = max(Xp,0) and 
negative part X- = —X v0 = max(—X,0). Knowing the result for X* 
implies the result for Xn.). 

Define fr(t) = t - 1{_R,Rj, the random variables X® = fr(Xn) and Y, = 


X) as well as 


Sn qe Th =~". 


i=1 


(i) It is enough to show that T, — E|T,] — 0. 
Proof. Since E[Y,,] — E[X.] = m, we get E[T;,,] — m. Because 


So Pin # Xa] < SOPLXn > nl => >PIX Sm] 
n>1 n>1 n>1 
= SOSoP[Xne€ [kk +1] 
n>1lk>n 
= Sok-P[X, € [k,k+1]] < E[X] < oo, 
k>1 


we get by the first Borel-Cantelli lemma that P[Y, # Xn, infinitely often] = 
0. This means T,, — S, — 0 almost everywhere, proving E[S,,] — m. 

(ii) Fix a real number a > 1 and define an exponentially growing subse- 
quence k,, = [a”] which is the integer part of a”. Denote by yz the law of. 
the random variables X,,. For every € > 0, we get using Chebychev inequal- 
ity (2.5.5), pairwise independence for k, = [a”] and constants C which can 
vary from line to line: 


fo} 


S> PllTin — ElTen]| > 4 


n=1 n=1 


A 
Ma 

g 

- 


1 C 
<() 2 y Var[Yim] om 


IA 
Q 
M 
| 
as) 
ir 


In (1) we used that with ky, = [a”] one has >> kul} <C-m-?. 


nikn>m ’n 
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Lets take some breath and continue, where we have just left off: 


So Plite, — ElPell > 


n=1 m=1 


A 
Q 
M 
. 
i 


IA 

Q 
Me 
3, a 
; 
a 
+ 

a 

= 

& 


< C -E[X]] < ow. 
In (2) we used that 77, _j,,;m-? <C-(1+1)7*. 


We have now proved complete (=fast stochastic) convergence. This implies 
the almost everywhere convergence of T;,,, — E[Ti,,] — 0. 


(iii) So far, the convergence has only be verified along a subsequence ky. 
Because we assumed X,, > 0, the sequence U, = ee Y, = nT, is mono- 
tonically increasing. For k € [km,km+i], we get therefore 

km Vk in = Vk < Un < Gigi = Kkm+1 are 


Km+1 km e Km+1 ~ nm km km Km+1 


and from limn—+oo In = E[X1] almost everywhere, 
*E[X1] < liminf T, < limsupT, < oE[X)] 
n n 


follows. O 


Remark. The strong law of large numbers can be interpreted as a statement 
about the growth of the sequence )*;_, Xn. For E[X1] = 0, the convergence 
+ h=1 Xn — 0 means that for alle > 0 there exists m such that forn > m 


| So Xn <en. 
k=1 


This means that the trajectory >; Xn is finally contained in any arbi- 
trary small cone. In other words, it grows slower than linear. The exact 
description for the growth of )°y_, Xn is given by the law of the iterated 
logarithm of Khinchin which says that a sequence of IID random variables 
Xn with E[X,| =m and o(X,) =o £ 0 satisfies 

Sn 


Sn 
lim sup — = +1, lim inf =-l1, 
oe ee eee Ae 
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with A, = /202n log logn. 


Remark. The IID assumption on the random variables can not be weakened 
without further restrictions. Take for example a sequence X, of random 
variables satisfying P[X, = +2"] = 1/2. Then E[X,] = 0 but even S,/n 
does not converge. 


Exercice. Let X; be IID random variables in £7. Define Y; = 
What can you say about S, = + p_1 Ye? 


rl 
_ 
= 


2.10 Birkhoff’s ergodic theorem 


In this section we fix a probability space (2,A,P) and consider sequences 
of random variables X, which are defined dynamically by a map T on 2 
by 

Xn(w) = X(T"(w)) , 


where T"(w) = T(T(...T(w))) is the n’th iterate of w. This can include 
as a special case the situation that the random variables are independent, 
but it can be much more general. Similarly as martingale theory covered 
later in these notes, ergodic theory is not only a generalization of classical 
probability theory, it is a considerable extension of it, both by language as 
by scope. 


Definition. A measurable map T : 2 — 12 from the probability space onto 
itself is called measure preserving, if P[J’~1(A)] = P[A] for all A € A. The 
map T is called ergodic if T(A) = A implies P[A] = 0 or P[A] = 1. A 
measure preserving map T is called invertible, if there exists a measurable, 
measure preserving inverse T~! of T. An invertible an measure preserving 
map T is also called an automorphism of the probability space. 


Example. Let 2 = {|z| = 1} C C be the unit circle in the complex plane 
with the measure P{Arg(z) € [a,6]] = (b-—a)/(27) forO <a<b < 2n 
and the Borel c-algebra A. If w = e?"'* is a complex number of length 1, 
then the rotation T(z) = wz defines a measure preserving transformation 
on (2, B,P). It is invertible with inverse T~!(z) = z/w. 


Example. The transformation T(z) = z? on the same probability space as 
in the previous example is also measure preserving. Note that P[T(A)] = 
2P[A] but P[T’~!(A)] = P[A] for all A € B. The map is measure preserving 
but it is not invertible. 


Remark. T is ergodic if and only if for any X € L? the condition X(T) = X 
implies that X is constant almost everywhere. 
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Example. The rotation on the circle is ergodic if a is irrational. Proof: 
with z = e27** one can write a random variable X on (2 as a Fourier series 
f(z) = S2._, @n2” which is the sum fo+f++f_, where f+ = ae 
is analytic in |z| < 1 and f_ = (°°, anz~” is analytic in |z| > 1 and fo is 
constant. By doing the same decomposition for f(T(z)) = [p45 @nw"2”, 
we see that fy = 0°, an2" = 2, anw”2”. But these are the Taylor 
expansions of f,; = f4(T) and so an = anw”. Because w” # 1 for irrational 
a, we deduce a, = 0 for n > 1. Similarly, one derives a, = 0 for n < —1. 
Therefore f(z) = ao is constant. 


Example. Also the non-invertible squaring transformation T(z) = zx? on 


the circle is ergodic as a Fourier argument shows again: T preserves again 
the decomposition of f into three analytic functions f = f_+ fo+ ft 
so that f(T(2)) = = re an2?* = Gn 2” implies P21 anz” = 

7-1 @nz”™. Comparing Taylor coefficients of this identity for analytic func- 
tions shows a, = 0 for odd n because the left hand side has zero Taylor 
coefficients for odd powers of z. But because for even n = 2'k with odd 
k, we have Qn = Qgty = Qgi-1_ = +++ = ax =O, all coefficients a, = 0 for 
k > 1. Similarly, one sees a, = 0 for k < —1. 


Definition. Given a random variable X € L, one obtains a sequence of 
random variables X, = X(T”) € L by X(T")(w) = X(T"w). Define Sp = 0 
and Sy =). pak (e"). 


Theorem 2.10.1 (Maximal ergodic theorem of Hopf). Given X € L', the 
event A = {sup,, S, > 0 } satisfies 


E[X; A] = E[14X] >0. 


Proof. Define Z, = maxo<k<n S% and the sets An = {Zn > O} C Atay 
Then A = U,, An. Clearly Z, € Li. For0<k <n, we have Z, > S_ and 
so Z,(T) > S(T) and hence 


Zn(T) +X > Sear. 
By taking the maxima on both sides over 0 < k <n, we get 


> 
Zn(T) +X 2 Gee : 


On A, = {Zn > 0}, we can extend this to Z,(T) +X > maxi<k<n+1 Sk 2 
maxo<k<n+1 Sk = Zn41 > Zn So that on An 
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Integration over the set A, gives 
E[X; An] > E[Z,; An] — E[Z,(T); An] - 


Using (1) this inequality, the fact (2) that Z, = 0 on X \ An, the (3) in- 
equality Z,(T) > S,(T) > 0 on A, and finally that T is measure preserving 
(4), leads to 


E[X;An] 20) ElZn; An] — E[Zn(T); An] 
=(2) E[Z,]—E[Z,(T);A J 
23) E[Zp — Zp(T)] =a) 0 


for every n and so to E[X; A] > 0. Oo 


Theorem 2.10.2 (Ergodic theorem of Birkhoff, 1931). For any X € L! the 
time average 


i=0 


converges almost everywhere to a T-invariant random variable X satisfying 
E[X] = E[X]. Especially, if T is ergodic, then S,,/n converges to E[X]. 


Proof. Define X = lim SUP, ,60 Sn, X = liminf,...5, . We get X = 
X(T) and X = X(T) because 


aos, aie (T) = 


(i) X = X. 

Define for a < 6 € R the sets Agg = {X < B,a < X}. Because {X < 
BG a ee B,0,8€Q 4a,8, it is enough to show that P[Aj,s] = 0 for rational 
a < Z. Define 


A= {sup(Sn —na)>O0}= {sup(Sn ~a)>0}. 


Because Ag,g C A and Ag, is T-invariant, we get from the maximal ergodic 
theorem E[X — a, Ags] > 0 and so 
E[X, Aa,g] >a: P[Aa.a] : 


Replacing X,a, @ with —X,—@, —a and using —X = —X, —X = —X gives 
E[X; Aa,s] < B-P{Aa,g] and because 2 < a, the claim P[Ag,4] = 0 follows. 
(i)XeL. _ 

[Sn| < |X|, and S,, converges point wise to X = X and X € L!. Lebesgue’s 
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dominated convergence theorem gives X EL. 


(iii) ELX] = E[X]. 

Define the sets Ben = {X € [£, *t*)} for k € Z,n > 1. Define for € > 0, 
Y=xX- E +e. Using the maximal ergodic theorem, we get E[Y; Bx,.n] > 0. 
Because € > 0 was arbitrary, 


E[X; Be,n] 2 


3 ix 


With this inequality 


k+1 1 
E[X, Ben] S$ ——P[Bkyn] S =P[Biyn] + BIXs Bisa] 


Summing over k gives 


EX] < = +E(X] 


and because n was arbitrary, E[X] < E[X]. Doing the same with —X and 
using (7), we end with 


B[-X] = E[-X] < EX] < E[-X]. 


Corollary 2.10.3. The strong law of large numbers holds for IID random 
variables X,, € L?. 


Proof. Given a sequence of IID random variables X, € L’. Let p be the 
law of X,,. Define the probability space 2 = (R2,A,P), where P = p” is 
the product measure. If T: Q — 2, T(w)n = Wn+1 denotes the shift on 2, 
then X, = X(T”) with with X(w) = wo. Since every T-invariant function 
is constant almost everywhere, we must have X = E[X] almost everywhere, 
so that S,/n — E[X] almost everywhere. 0 


Remark. While ergodic theory is closely related to probability theory, the 
notation in the two fields is different. The reason is that the origin of 
the theories are different. One usually writes (X,A,m) for a probability 
space. An example of different language is also that ergodic theorists do 
not use the word ”random variables” X but speak of ”functions” f. Good 
introductions to ergodic theory are [36, 13, 8, 77, 54, 107]. 
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2.11 More convergence results 
We mention now some results about the almost everywhere convergence of 


sums of random variables in contrast to the weak and strong laws which 
were dealing with averaged sums. 


Theorem 2.11.1 (Kolmogorov’s inequalities). a) Assume X;, € £2 are inde- 
pendent random variables. Then 


P[ sup |S; — E[S,]| > €] < = Var|Sq] : 
1<k<n € 


b) Assume X;, € £® are independent random variables and |Xnlloo < R. 


Then 
(R+6)? 


p Sz — E[S,]| > €] > 1- ——__—_+__.. 
ey « — ElSe]l 2 el 2 don Var[Xx] 


Proof. We can assume E[X;] = 0 without loss of generality. 
a) For 1 < k <n we have 


S? — $2 = (S, — Sk)? + 2(Sn — Sk)Sk = 2(Sn — Sk)Sk 
and therefore E[S7;A,] > E[S?; Ag] for all Ay € o(X1,...,X%) by the 
independence of S, — S, and S;. The sets A, = {|Si| > €}, Angi = 


{|Sk41] > €,max;<i<x |Si| < €} are mutually disjoint. We have to estimate 
the probability of the events 


Bn = { max |S >eh= U4 ; 
We get 
E[Sn] 2 E[Sn; Bn] = )_ E[S2; Ax] > S~ E[S3; Ag] > 2 S> P[Ax] = 2P[Bn) . 
k=1 k=1 k=1 
b) 
E[Sg; Bn] = E[Sf] — E[Sg; Be] > E92] — e?(1 — P[B,]) . 


On Ax, |Se—1| < € and |Sx| < |Se_i| + |X| < €+R holds. We use that in 
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the estimate 


E[Sq3 Bn] = SEIS? + (Su — Se)*s Ae] 
k=l 
= > BSt Ax] + zn — Sx)”; Ar] 
< R+6? SPA + 5 Plas > Var[X,] 
< PIB Bale 8) * + E[Sz)) -_ 
so that 


E[S2] < P[Bn]((e + R)? + E[S2]) + e? — e?P[Bn] . 
and so 


E[Sa] — 2 (e+ R)? («+ R)? 
PiBal > Cp wyr+ BS) e =! ea RP +E 2 EIS 


O 


Remark. The inequalities remain true in the limit n — oo. The first in- 
equality is then 


1 co 
Pisup |S~ — EfSx]| > €] < 2 Y= Var[X«] . 
k=1 


Of course, the statement in a) is void, if the right hand side is infinite. In 
this case, however, the inequality in 6) states that sup, |Sx — E[Sx]| > € 
almost surely for every € > 0. 


Remark. For n = 1, Kolmogorov’s inequality reduces to Chebychev’s in- 
equality (2.5.5) 


a 


Lemma 2.11.2. A sequence X,, of random variables converges almost ev- 
erywhere, if and only if 


lim Fup \Xnik — Xn| >] =0 


n—- CoO 


for all « > 0. 


Proof. This is an exercise. oO 
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Theorem 2.11.3 (Kolmogorov). Assume X, € L? are independent and 
or Var[Xn] < 00. Then 


don - E[Xn]) 
n=1 


converges almost everywhere. 


Proof. Define Y, = X, — E[X,] and S, = ogni Ye. Given m € N. Apply 
Kolmogorov’s inequality to the sequence Ym+k to get 


1 3 
P[sup |Sn -Sml 2453 > ELX#] 0 
= k=m+1 
for m — oo. The above lemma implies that S, (w) converges. O 


Figure. We sum up indepen- 
dent random variables X, 
which take values # with 
equal probability. According to 
theorem (2.11.3),the process 


Sn = wee — E[Xx]) = ee 


k=1 k=1 


converges if 


converges. This is the case if a > 
1/2. The picture shows some ex- 
periments in the case a = 0.6. 


The following theorem gives a necessary and sufficient condition that a 
sum S, = pa Xj, converges for a sequence X,, of independent random 
variables. 


Definition. Given R € R and a random variable X, we define the bounded 
random variable 
X?) = LiyicnX . 
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Theorem 2.11.4 (Three series theorem). Assume X, € £ be independent. 
Then 5-7, Xn converges almost everywhere if and only if for some R > 0 
all of the following three series converge: 


So PI Xe| > RB] < 0, (2.4) 
k=1 

DIEIX," I] < 00, (2.5) 
k=1 

S_ Var[xX,” <x OO: (2.6) 


> 
WW 
an 


Proof. *=>” Assume first that the three series all converge. By (3) and 
Kolmogorov’s theorem, we know that 772, X Go) —E[X (8) ] converges al- 
most surely. Therefore, by (2), [224 x converges almost surely. By 
(1) and Borel-Cantelli, P[X; # xo infinitely often) = 0. Since for al- 
most all w, X (uw) = X;,(w) for sufficiently large k and for almost all 
iy ye Se (w) converges, we get a set of measure one, where )>y., Xk 
converges. 

”<-” Assume now that }~~~_, Xn converges almost everywhere. Then Xx — 
0 almost everywhere and P[|X;| > R, infinitely often) = 0 for every R > 0. 
By the second Borel-Cantelli lemma, the sum (1) converges. 

The almost sure convergence of )-~~_, Xn implies the almost sure conver- 
gence of )-", X®) since P[|X;| > R, infinitely often) = 0. 

Let R > 0 be fixed. Let Y,; be a sequence of independent random vari- 
ables such that Y, and X ae have the same distribution and that all the 
random variables X vo. Y; are independent. The almost sure convergence 
of 37°, X\” implies that of 7%, x _ y,. Since E[X(”) — ¥,] = 0 
and P{|X GY — Y,| < 2R) = 1, by Kolmogorov inequality b), the series 
a ae x — Y, satisfies for all « > 0 


(R+e)? 


Paine ea 
mi neg 00 Var[ XL” — ¥4] 


Claim: S77), Var[X\” — ¥;] < 00. 
Assume, the sum is infinite. Then the above inequality gives P(sup,s |Tn+k— 
T,| > €] = 1. But this contradicts the almost sure convergence of )77_, X pe 
Y; because the latter #mplies by Kolmogorov inequality that P[sup,>) |Sn+%— 
Sn| > €] < 1/2 for large enough n. Having shown that S772, (Var[X, (R) —- 
Y;,)] < 00, we are done because then by Kolmogorov’s theorem (2.11.3), 
the sum $7. X vo ~E[X (Ry converges, so that (2) holds. 

0 


76 Chapter 2. Limit theorems 


Figure. A special case of the 
three series theorem is when X;, 
are uniformly bounded X, < 
R and have zero expectation 
E[X;] = 0. In that case, almost 
everywhere convergence of S, = 
weei Xk is equivalent to the 
convergence of ) ;~, Var[Xx]. 
For example, in the case 


LL 
x= { ae ’ 


pare 
ko 
and a = 1/2, we do not have 
almost everywhere convergence 
of Sn, because S 7p, Var[Xx] = 
oe} par 
k=1 % — %- 


Definition. A real number a € R is called a median of X € CL if P[X < 
a] > 1/2 and P[X > a] > 1/2. We denote by med(X ) the set of medians 
of X. 


Remark. The median is not unique and in general different from the mean. 
It is also defined for random variables for which the mean does not exist. 


Proposition 2.11.5. (Comparing median and mean) For Y € £?. Then every 
a € med(Y) satisfies 
la -E[Y]] < Vaoly]. 


Proof. For every 3 € R, one has 
— Bl? 


Now put 6 = E[Y]. 0 


Theorem 2.11.6 (Lévy). Given a sequence Xn € L which is independent. 
Choose a, € med(S; — S;,). Then, for all n € N and all e > 0 


P[ max ISn + On,k| > €] < 2P[S, >]. 


OT OO 
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Proof. Fix n € N and € > 0. The sets 


Ay = {$1 + On1 > €}, Anti = {max (Sn + Ont) < € Sk+1 + On,k+1 2 €} 


for 1 < k <n are disjoint and Up_, Ae = {maxicn<n(Se + an) 2 e}. 
Because {S,, > € } contains all the sets A; as well as {Sn — Sk > n,n} for 
1<k<7n, we have using the independence of o(A,) and o(Sp — Se) 


P[Sn>e] > DS P{Sn— Sk > ane} Ax] 
k= 


IV 


Applying this inequality to —X,, we get also P[—S, — Qnm > —€] > 
2P[-S, > —«] and so 


>el< > eé). 
P[ max Sn + Qn,k| > €] < 2P[S, > €] 


Corollary 2.11.7. (Lévy) Given a sequence X, € £ of independent random 
variables. If the partial sums S, converge in probability to S, then S, 
converges almost everywhere to S. 


Proof. Take ou,, € med(S; — S;). Since S, converges in probability, there 
exists m, € N such that |ja:,| < ¢€/2 for all m; < k < 1. In addition, 
there exists m2 € N such that sup, 1 P[|Snim — Sm| > €/2] < €/2 for all 
m > mo. For m= max{m,,m2}, we have for n >1 


P[ max, {St4+m = Sm| > al < P| max, |Sitm — Sm + On+m,l+m| 2 e/2] . 


The right hand side can be estimated by theorem (2.11.6) applied to Xnim 
with , 


Now apply the convergence lemma (2.11.2). O 
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Exercice. Prove the strong law of large numbers of independent but not 
necessarily identically distributed random variables: Given a sequence. of 
independent random variables X,, € L? satisfying E[X,] = m. If 


So Var[Xx]/k? <o, 
k=1 


then S,/n — m almost everywhere. 
Hint: Use Kolmogorov’s theorem for Y, = X;/k. 


Exercice. Let X, be an IID sequence of random variables with uniform 
distribution on (0, 1]. Prove that almost surely 


oo n 


ST] Xi <0. 


n=li=l 


Hint: Use Var[[]; Xi] = [] E[X?] — [] E[X;]? and use the three series theo- 


rem. 


2.12 Classes of random variables 


The probability distribution function Fx : R > (0, 1] of a random variable 
X was defined as 


Fx (z) = P[X <a], 


where P[X < a] is a short hand notation for P[{w € | X(w) < x }. With 
the law px = X*P of X on R has Fx(x) = f° |, du(x) so that F is the 
anti-derivative of u. One reason to introduce distribution functions is that 
one can replace integrals on the probability space Q by integrals on the real 
line R which is more convenient. _ 


Remark. The distribution function Fy determines the law x because the 
measure v((—00,a]) = Fx(a) on the m-system TI given by the intervals 
{(—0o, a]} determines a unique measure on R. Of course, the distribution 
function does not determine the random variable itself. There are many 
different random variables defined on different probability spaces, which 
have the same distribution. 
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Proposition 2.12.1. The distribution function Fx of a random variable is 


a) non-decreasing, 
b) Fx(—00) = 0, Fx (oo) = 1 
c) continuous from the right: Fx (2 +h) = Fx. 


Furthermore, given a function F with the properties a), b),c), there exists 
a random variable X on the probability space (2,A,P) which satisfies 
Fy =F. 


Proof. a) follows from {X <a2}C{X <y} fora <y.b) P[{X < —n}] > 
0 and P[{X < n}] 3 1.c) Fx(w+h)—- Fx =Plx < X <x+h] — 0 for 
h-0. 

Given F, define 2 = R and A as the Borel o-algebra on R. The measure 
P((—oo, al] = F[a] on the 7-system Z defines a unique measure on (22, A). 


Remark. Every Borel probability measure yz on R determines a distribution 
function Fy of some random variable X by 


The proposition tells also that one can define a class of distribution func- 
tions, the set of real functions F' which satisfy properties a), b),c). 


Example. Bertrands paradox mentioned in the introduction shows that the 
choice of the distribution functions is important. In any of the three cases, 
there is a distribution function f(z,y) which is radially symmetric. The 
constant distribution f(z, y) = 1/7 is obtained when we throw the center of 
the line into the disc. The disc A, of radius r has probability P[A,] = r?/7. 
The density in the r direction is 2r/7. The distribution f(z,y) = 1/r = 
1/\/z? + y? is obtained when throwing parallel lines. This will put more 
weight to center. The probability P[A,] = r/m is bigger than the area of 
the disc. The radial density is 1/7. f(x,y) is the distribution when we 
rotate the line around a point on the boundary. The disc A, of radius r 
has probability arcsin(r). The density in the r direction is 1//1 — r?. 
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Figure. A plot of the radial Figure. A plot of the radial dis- 
density function f(r) for the tribution function F(r) = P{A,| 
three different interpretation of There are different values at 
the Bertrand paradoz. F(1/2). 


So, what happens, if we really do an experiment and throw randomly lines 
onto a disc? The punch line of the story is that the outcome of the ex- 
periment very much depends on how the experiment will be performed. If 
we would do the experiment by hand, we would probably try to throw the 
center of the stick into the middle of the disc. Since we would aim to the 
center, the distribution would be different from any of the three solutions 
given in Bertrand’s paradox. 


Definition. A distribution function F is called absolutely continuous (ac), if 
there exists a Borel measurable function f satisfying F(x) = (ae f(x) dz. 
One calls a random variable with an absolutely continuous distribution 
function a continuous random variable. 


Definition. A distribution function is called pure point (pp) or atomic if 
there exists a countable sequence of real numbers z, and a sequence of 
positive numbers pn, >_>, Pn = 1 such that F(z) = )),.2,<2Pn- One calls 
a random variable with a discrete distribution function a discrete random 
variable. 


Definition. A distribution function F is called singular continuous (sc) if F 
is continuous and if there exists a Borel set S of zero Lebesgue measure such 
that r(S) = 1. One calls a random variable with a singular continuous 
distribution function a singular continuous random variable. 


Remark. The definition of (ac),(pp) and (sc) distribution functions is com- 
patible for the definition of (ac),(pp) and (sc) Borel measures on R. A Borel 
measure is (pp), if u(A) = )°,<,4 u({a}). It is continuous, if it contains no 
atoms, points with positive measure. It is (ac), if there exists a measurable 
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function f such that = f dz. It is (sc), if it is continuous and if u(S) = 1 
for some Borel set S of zero Lebesgue measure. 


The following decomposition theorem shows that these three classes are 
natural: 


Theorem 2.12.2 (Lebesgue decomposition theorem). Every Borel measure 
p on (R, B) can be decomposed in a unique way as ft = ppp + Mac + Use; 
where [pp is pure point, isc is singular continuous and flac is absolutely 
continuous with respect to the Lebesgue measure A. 


Proof. Denote by \ the Lebesgue measure on (R, B) for which X([a, 6) = 
b—a. We first show that any measure pz can be decomposed as pp = fact Hs, 
where flac is absolutely continuous with respect to A and ps is singular. The 


decomposition is unique: p = ms + ps (2) _ ee uv?) implies that yee 
p2) = = po — pe?) is both absolutely ee Sa singular continuous with 


respect to ys which is only possible, if they are zero. To get the existence 
of the decomposition, define ¢ = sup 44 y(4)=0 p(A). If c = 0, then p is 
absolutely continuous and we are done. If c > 0, take an increasing sequence 
An € B with p(An) — c. Define A = U3, An and fac aS Hac(B) = 
u(ANB). To split the singular part jz, into a singular continuous and pure 


point part, we again have uniqueness because pis = po + py?) = = hse (2) +p ) 


implies that v = uD — 2 = pe?) - p2) are both singular continuous anil 
pure point which implies that v = 0. To get existence, define the finite or 


countable set A = {w | u(w) > 0 } and define ppp(B) = u(AN B). O 


Definition. The Gamma function is defined for x > 0 as 
ra) / ie are, 
0 


It satisfies [(n) = (n — 1)! for n € N. Define also 


1 
BOs 7 oP 1(1—2)9} de, 
0 


the Beta function. 
Here are some examples of absolutely continuous distributions: 


acl) The normal distribution N(m, oc?) on 2 = R has the probability den- 
sity function 
1 z-m)? 


Ad aa oe 22. 
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ac2) The Cauchy distribution on 2 = R has the probability density function 


(oso 


~ t+ (2—m)p ” 


ac3) The uniform distribution on Q = [a,b] has the probability density 
function 


ac4) The exponential distribution \ > 0 on 2 = [0, 00) has the probability 
density function 


f(z) =Ae* 


ac5) The log normal distribution on 2 = [0,00) has the density function 


1 _ Goglz =m)? 
xg) = —————eE 20 . 
fe) V2rx202 


ac6) The beta distribution on 2 = (0, 1] with p > 1,q > 1 has the density 
zP-1(1 — g)a-l 
f(x) = op ~ 2) t 
B(p,q) 
ac7) The Gamma distribution on Q = (0, co) with parameters a > 0,3 >0 


ge] g%e— 2/6 


tS hay 


JNO AVL 


Figure. The probability density Figure. The probability density 
and the CDF of the normal dis- and the CDF of the Cauchy dis- 
tribution. tribution. 
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iL 


Figure. The probability density Figure. The probability density 
and the CDF of the uniform dis- and the CDF of the exponential 
tribution. distribution. 


Definition. We use the notation 


(i) * wen 


for the Binomial coefficient, where k! = k(k—1)(k—2) --- 2-1 is the factorial 
of k with the convention 0! = 1. For example, 


10 10! 
( 3 ) = Fag = 10*9* 8/6 = 120. 


Examples of discrete distributions: 
ppl) The binomial distribution on 2 = {1,...,n } 


por=aj=(f )pta-ay 


pp2) The Poisson distribution on 2 = N 


tu 
= — p7A 


pp3) The Discrete uniform distribution on 2 = sg ere 


PIX = 4] == 


pp4) The geometric distribution on 92 = N = {0,1,2,3,...} 
P[X =k] = p(1—p)* 


pp5) The distribution of first success on 2 = N \ {0} = {1)2,3, «3. } 


P[X =k] = p(1—p)* 
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Figure. The probabilities and the Figure. The probabilities and the 
CDF of the binomial distribution. CDF of the Poisson distribution. 


0.2 | 
oe 
0.125 0.1 
o. 
0.07 0.4 
0168 9.0 
ae 


Figure. The probabilities and the Figure. The probabilities and the 
CDF of the uniform distribution. CDF of the geometric distribution. 


An example of a singular continuous distribution: 


scl) The Cantor distribution. Let C = No En be the Cantor set, 
where Eo = [0,1], BE, = [0,1/3] U [2/3,1] and E,, is inductively 
obtained by cutting away the middle third of each interval in 
E,~,. Define 


F(z) = im, F,,(x) 


where F;, (x) has the density (3/2)"-1p, . One can realize a random 
variable with the Cantor distribution as a sum of IID random 
variables as follows: 


where X,, take values 0 and 2 with probability 1/2 each. 
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Figure. The CDF of the Cantor 
distribution is continuous but not ae 
absolutely continuous. The func- 
tion Fx(x) is in this case called 
the Cantor function. Its graph is 
also called a Devils staircase 


Lemma 2.12.3. Given X € £ with law y. For any measurable map h : R! > 
[0, 00) for which h(X) € L’, one has E[h(X)] = Ja h(x) du(x). Especially, 
if u = Mac = f dx then 


B[h(X)] = i h(a) f(a) de 
If 2 = pp, then 


Elh(X)]= > A(a)u({2}). 
r,u({x})40 


Proof. If the function h is nonnegative, prove it first for X = clzea, then 
for step functions X € S and then by the monotone convergence theorem 
for any X € L for which h(z) € CL’. If h(X) is integrable, then E[h(X)] = 
B[ht(X)] — E[h-(X)]. O 


Proposition 2.12.4. 


ac5) Log-Normal | m € R, 0? > 0 (e7 —1)e2?™*+e 
pi(p+4) 
ao?) Gamma [a,8>0 [ad os? 
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Se 2. nt 5. 
ppl) Bernoulli [mE N, pe [0,1] | mp | np(—») _] 
SUR ae Ea 
Gap 
sel) Cantor [-—S«di 2 S«d 


Proof. These are direct computations, which we do in some of the examples: 
Exponential distribution: 


Ce. | 
E[X?] = / are de = PEP = = 
0 nN aP 


Poisson distribution: 


oo ane < send rv 1 
E[X] = 5_ ke a aay => 
k=0 k=1 


For calculating higher moments, one can also use the probability generating 
function 


oo k 
E[z*] = ye (Az) = e-A(-2) 


rar k! 


and then differentiate this identity with respect to z at the place z = 0. We 
get then 


E[X] = A, E[X(X — 1)] = 4”, E[X3] = E[LX(X — 1)(X —2)],... 


so that E[X?] = \+ \? and Var[X] = 
Geometric distribution. Differentiating the identity for the geometric series 


are 


k=0 


gives 


Therefore 


E[Xp] = >_> k(1—p)*p = Sk — p)*p = p >) (Lp) = P= aS 
k=0 k=0 Dp Pp 


For calculating the higher moments one can proceed as in the Poisson case 
or use the moment generating function. 
Cantor distribution: because one can realize a random variable with the 
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Cantor distribution as X = )>°°_, X,/3", where the IID random variables 
X,y take the values 0 and 2 with probability p = 1/2 each, we have 


oo oO 
E[Xal 1 1 1 
E[X] = = —= - 
ee eee ee 
n=1 n=1 
and 
Var[Xn] Var[Xn] wl 1 9 1 
OO - —_—_ = —l = Se SS es 
Var|X} = 5 a 3n >> gn ‘3 gr 1-1/9 8 1 8 
n=1 n=1 n=1 
See also corollary (3.1.6) for an other computation. O 


Computations can sometimes be done in an elegant way using character- 
istic functions ¢x(t) = Ele**] or moment generating functions Mx (t) = 
E[e’*]. With the moment generating function one can get the moments 
with the moment formula 


d°M 
E[X™] = i. a” du= * (é)|ex0 - 
R 
For the characteristic function one obtains 


B[X"] = f 2" du = (ir SE (ile. 


Example. The random variable X(x) = z has the uniform distribution 
n [0,1]. Its moment generating function is Mx(t) = 5 e® dr = (et — 
1)/t = 1+t/2!+t?/3!+.... A comparison of coefficients gives the moments 
E[X™] = 1/(m+ 1), which agrees with the moment formula. 


Example. A random variable X which has the Normal distribution N(m, a) 
has the moment generating function Mx (t) = etm+o"t’/2_ All the moments 
can be obtained with the moment formula. For example, E[X] = Mj (0) = 
m, E[X?] = M¥(0) = m? +07. 

Example. For a Poisson distributed random variable X on Q = N = 
{0,1,2,3,...} with P[X =k] =e7 ae , the moment generating function is 


oO 


Mx(t) = S>P[X = ke = eX-8) , 


k=0 


Example. A random variable X on 2 = N = {0,1,2,3,... } with the 
geometric distribution P[X = k] = p(1 — p)* has the moment generating 
function 


ns a ee ee 
Mx(t) = dP p(1 = pd» ae 
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A random variable X on 2 = {1,2,3,... } with the distribution of first 
success PLX = k] = p(1 — p)*—!, has the moment generating function 


t 


Mx(t) = Scep(1 —p)*-! = ep} (1 — pet) = PO 7 
x > p(l—p Prat Dey Tae 


Exercice. Compute the mean and variance of the Erlang distribution 


dee} 
f(x) = ean 


on the positive real line Q = [0, 00) with the help of the moment generating 
function. If k is allowed to be an arbitrary positive real number, then the 
Erlang distribution is called the Gamma distribution. 


Lemma 2.12.6. If X,Y are independent random variables, then their mo- 
ment generating functions satisfy 


Mx.y(t) = Mx(t)» My(t) . 


Proof. If X and Y are independent, then also e'* and eY are independent. 
Therefore, 


Ble(X+¥)] = Ble et] = Ele*]Bfet”] = Mx(t)- My(t) 
O 


Example. The lemma can be used to compute the moment generating 
function of the binomial distribution. A random variable X with bino- 
mial distribution can be written as a sum of IID random variables X; 
taking values 0 and 1 with probability 1 — p and p. Because for n = 1, 
we have Mx,(t) = (1 — p) + pe’, the moment generating function of X 
is Mx(t) = [(1 — p) + pe’]”. The moment formula allows us to compute 
moments E[X”] and central moments E[(X — E[X])"] of X. Examples: 


E[X] = np 
E[X*] = np(1-p+np) 
Var[X] = B[(X — E[X})*] = E[X| - E[X]? = np(1 —p) 
E[X7] = np(1+3(n—1)pt (2—3n+n?)p?) 
E[X4] = np(1+7(n—1)p+ 6(2—3n 


+n?)p? + (—6 + 1In — 6n? + n®)p’) 
E[(X - E[X])*] = BlX4] — 8E[X]E[X*| + 6E[X??? + E[x]* 
= np(1—p)(1+ (5n —6)p — (-6 + n + 6n?)p?) 
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Example. The sum X +Y of a Poisson distributed random variable X with 
parameter \ and a Poisson distributed random variable Y with parameter 
pis Poisson distributed with parameter A+ yz as can be seen by multiplying 
their moment generating functions. 


Definition. An interesting quantity for a random variable with a continuous 
distribution with probability density fx is the Shannon entropy or simply 
entropy 


H(X) =~ f fle)log(f(@)) ae 


Without restricting the class of functions, H(X) is allowed to be —oo or 
oo. The entropy allows to distinguish several distributions from others by 
asking for the distribution with the largest entropy. For example, among all 
distribution functions on the positive real line [0, oo) with fixed expectation 
m = 1/A, the exponential distribution Ae~ is the one with maximal en- 
tropy. We will return to these interesting entropy extremization questions 
later. 


Example. Let us compute the entropy of the random variable X(z) = 2™ 
on ((0,1],8,dz). We have seen earlier that the density of X is fx(z) = 
gi/™—1 /m so that 


H(X) =- [evry log(z'/"—1/m) dz . 


To compute this integral, note first that f(x) = x* log(x*) = ax® log(z) has 
the antiderivative az!+*((1+a) log(x)—1)/(1+a)? so that fe x° log(x*) dx = 
—a/(1+a?) and H(X) = (1—m+log(m)). Because 54 H(Xm) = (1/m)—1 
and oe, (Xm) = —1/m?, the entropy has its maximum at m = 1, where 
the density is uniform. The entropy decreases for m — oo. Among all ran- 
dom variables X(x) = x™, the random variable X(x) = x has maximal 
entropy. 


Figure. The entropy of the ran- 
dom variables X(z) = x™ on 
[0,1] as a function of m. The 
maximum is attained form = 1, 
which is the uniform distribution 
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2.13 Weak convergence 


Definition. Denote by C,(R) the vector space of bounded continuous func- 
tions on R. This means that || f||oo = oer |f(x)| < 00 for every f € Cp(R). 
A sequence of Borel probability measures yz, on R converges weakly to a 
probability measure yu on R if for every f € C;(IR) one has 


[fain ff ay 


in the limit n — oo. 


Remark. For weak convergence, it is enough to test fo f dim — f xf du 
for a dense set in C,(R). This dense set can consist of the space P(R) of 
polynomials or the space C£°(R) of bounded, smooth functions. 

An important fact is that a sequence of random variables X, converges 
in distribution to X if and only if E[h(X,)| — E[h(X)] for all smooth 
functions h on the real line. This will be used the proof of the central limit 
theorem. 


Weak convergence defines a topology on the set M,(R) of all Borel proba- 
bility measures on R. Similarly, one has a topology for My((a, b]). 


Lemma 2.13.1. The set M/(J) of all probability measures on an interval 
I = [a,b] is a compact topological space. 


Proof. We need to show that any sequence pz, of probability measures on 
I has an accumulation point. The set of functions f(z) = x* on [a,b] span 
all polynomials and so a dense set in C;({a, b]). The sequence pi, converges 


to w if and only if all the moments ii z* dun converge for n > oo and for 
all k € N. In other words, the compactness of M,([a, b]) is equivalent to the 
compactness of the product space JN with the product topology, which is 
Tychonovs theorem. O 


Remark. In functional analysis, a more general theorem called Banach- 
Alaoglu theorem is known: a closed and bounded set in the dual space X* 
of a Banach space X is compact with respect to the weak-* topology, where 
the functionals jz, converge to y if and only if up(f) converges to p(f) for 
all f € X. In the present case, X = C;[a,b] and the dual space X* is the 
space of all signed measures on [a, b] (see [7]). 


Remark. The compactness of probability measures can also be seen by 
looking at the distribution functions F),(s) = u((—0o, s]). Given a sequence 
F,, of monotonically increasing functions, there is a subsequence Fy, which 
converges to an other monotonically increasing function F,, which is again 
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a distribution function. This fact generalizes to distribution functions on 
the line where the limiting function F is still a right-continuous and non- 
decreasing function Helly’s selection theorem but the function F' does not 
need to be a distribution function any more, if the interval |a, b] is replaced 
by the real line R. 


Definition. A sequence of random variables X,, converges weakly or in law 
to a random variable X, if the laws ux, of X, converge weakly to the law 
px of X. 


Definition. Given a distribution function F’, we denote by Cont(F) the set 
of continuity points of F. 


Remark. Because F is nondecreasing and takes values in [0,1], the only 
possible discontinuity is a jump discontinuity. They happen at points %;, 
where a; = u({t;}) > 0. There can be only countably many such disconti- 
nuities, because for every rational number p/gq > 0, there are only finitely 
many a; with a; > p/q because er ay <1. 


Definition. We say that a sequence of random variables X,, converges in 
distribution to a random variable X, if Fx, (x) + Fx(x) point wise for all 
xz € Cont(F). 


Theorem 2.13.2 (Weak convergence = convergence in distribution). A se- 
quence X,, of random variables converges in law to a random variable X if 
and only if X,, converges in distribution to X. 


Proof. (i) Assume we have convergence in law. We want to show that we 
have convergence in distribution. Given s € Cont(f) and 6 > 0. Define a 
continuous function 1(_.,5) < f < 1(—00,s+4]- Then 


Fal) = f cso in S fF din S foot, din = Fo(s+8) 


This gives 


lim sup F;,(s) < lim [fein = ff dus Fe+s). 
noo 


m— CO 


Similarly, we obtain with a function 1(_.,s—s) < f < 1(—c0,5] 


lim inf F,(s) > lim ie djin = fs du > F(s—64). 
noo Nn—0oO 
Since F is continuous at x we have for 6 — 0: 


F(s)= lim F(s—6)< lim inf Fn (s) < limsup F;,(s) < F(s) . 
n—0O 
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That is we have established convergence in distribution. 

(ii) Assume now we have no convergence in law. There exists then a con- 
tinuous function f so that ff dun to Jf dp fails. That is, there is a 
subsequence and € > 0 such that | f f dun, — f f du| > > 0. There exists 
a compact interval J such that | f; f dun, — ff du| > «/2 > 0 and we 
can assume that un, and p have support on I. The set of all probability 
measures on I is compact in the weak topology. Therefore, a subsequence 
of din, converges weakly to a measure v and |v(f) — u(f)| = €/2. De 
fine the 7-system Z of all intervals {(—0o, s] | s continuity point of F }. 
We have pn((—00, s]) = Fx, (8) > Fx(s) = p(—oo, s]). Using (i) we see 
Ln, ((—00, $]) + v(—00, s] also, so that y and v agree on the 7 system Z. If 
p and v agree on T, they agree on the 7-system of all intervals {(—0oo, s]}. 
By lemma (2.1.4), we know that 4 = v on the Borel o-algebra and so ps = v. 
This contradicts |v(f) — u(f)| > €/2. So, the initial assumption of having 
no convergence in law was wrong. 0 


2.14 The central limit theorem 


Definition. For any random variable X with non-zero variance, we denote 
by 


~ _ (X-EIX)) 
a o(X) 


the normalized random variable, which has mean E[X*] = 0 and variance 
o(X*) = ./Var[X*] = 1. Given a sequence of random variables X;, we 
again use the notation S, = }>;_, Xk. 


Theorem 2.14.1 (Central limit theorem for independent L* random vari- 
ables). Assume X; € C3 are independent and satisfy 


1 nr 
M =sup||Xil|3 < co, 6= lim inf — S© Var[Xi] >0. 
: eta T= 


Then S* converges in distribution to a random variable with standard 
normal distribution N(0, 1): 


lim P{S* <a] = “¥/2 dy, We eR. 


1 x 
— e€ 
n—0o /2r I. 
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Figure. The probabil- 
ity density function 
fs; of the random 
variable X(x) = x on 
{-1, 1). 


Figure. The probabil- 
ity density function 
fs; of the random 
variable X(x) = x on 
[-1, 1]. 
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Figure. The probabil- 
ity density function 
Fs. of the random 
variable X(x) = x on 
[—1, 1}. 


Lemma 2.14.2. A N(0,0?) distributed random variable X satisfies 


E(IX/] = iG )). 


Especially E[|X [3] = [80° 


x2 
Proof. With the density function f(z) = (210?)~1/2e7 207 , we have E[|X|?} = 
2 f>° x? f(x) dx which is after a substitution 2 = x? /(207) equal to 
1 


lo e) 
—_9P/2 GP 2 (Ptl)—-le-2 dy 
Vr 0 


The integral to the right is by definition equal to P'(5(p + 1)). oO 


After this preliminary computation, we turn to the proof of the central 
limit theorem. 


Proof. Define for fixed n > 1 the random variables 


(Xi — E[Xi]) 


co an) 


,»l<icn 


so that S* = S77, Y;. Define N(0,c)-distributed random variables Y, hav- 
ing the property that the set of random variables 
{Vasey Vas Vary Yn } 


are independent. The distribution of S, = pa Y; is just the normal distri- 
bution N(0, 1). In order to show the theorem, we have to prove E[f(S7.)] — 
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Ez f(Sp)]| > 0 for any f € C;,(R). It is enough to verify it for smooth f. 
Define 2 7 
Z2E=Vi+...Ye-1+Yeqi te +MY. 

‘Note that Z, + Y; = S* and Z, + Y, = Sp. Using first a telescopic sum 
and then Taylor’s theorem, we can write 


f ($2) — f(Sn) 


So Uf (Ze + Ye) — f(Ze + Yu] 


k=1 
= SU ad -Kl+ Vier Zeer? - H2)) 
k=1 k=1 


+S [R(Zx, Ya) + R(Ze, Ye)] 
k=1 


with a Taylor rest term R(Z,Y), which can depend on f. We get therefore 


3 


|E[F(S%)] — E[f(Sn)]| < FI (|\R(Zk, Ye) I] + El|R(Ze,Ye)|]. (2.7) 


Because Y, are N(0,o?)-distributed, we get by lemma (2.14.2) and the 
Jensen inequality (2.5.1) 


Ei] = [Eo = [Segue < [Syme 


Taylor’s theorem gives |R(Zx, Y)| < const - |Y,|° so that 


YEIR(Ze, Ye)ll + BI R(Ze, %e)l) <_ const- STBLYel’ 


k=1 k=1 


< const -n-sup || Xi||3 /Var[Sp.]°/ 
7 sup; ||Xills_ 1 
~ OP WarlSa]/n)92 Va 
_ M1 _ CH) og 
~ BR Ta Vn 
We have seen that for every smooth f € C,(R) there exists a constant C(f) 
such that |E[f(S;)] — E[f(Sn)]| < C(f)/vn. = 


if we assume the X; to be identically distributed, we can relax the condition 
XveD ie Xe L: 


Theorem 2.14.3 (Central limit theorem for IID L? random variables). If 
X; € CL? are IID and satisfy 0 < Var[X;], then S* converges weakly to a 
random variable with standard normal distribution N (0, 1). 
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Proof. The same proof gives equation (5.4). We change the estimation of 
Taylor |R(z,y)| < 6(y) - y? with d(y) — 0 for |y| — 0. Using the IID 
property and using dominated convergence we can estimate the rest term 


R =~ E(|R(Ze, Ye)|] + Ell (Ze, Ye) 
-1 
as follows: 
Bix STE E[6(Yn)¥2] + Ef5(¥.) YZ] 
k=1 


V2 
=n BAL) AE +n BAR) SH 
Kena e 
< n-E[6( sl L} 4 eae 
xf 
= B(6( sult 1) + Bol ee 
< IC +0 


( 
0 


The central limit theorem can be interpreted as a solution to a fixed point 
problem: 


Definition. Let Po,1 be the space of probability measure y on (R, Br) which 
have the properties that f, x? du(r) = 1, f, x du(x) = 0. Define the map 


Tua) =f faa 


(dx) p(dy) 


on P91. 


Corollary 2.14.4. The only attracting fixed point of T on Po,1 is the law of 
the standard normal distribution. 


Proof. If 4: is the law of a random variable X with Var[X] = 1 and E[X] = 
0. Then T'(j) is the law of the normalized random variable (X + X)/V/2 be- 
cause the independent random variables X,Y can be realized on the proba- 
bility space (IR, B, 4.x 4) as coordinate functions X ((x, y)) = x, ¥((z,y)) = 
y. Then T() is obviously the law of (X + Y)//2. Now use that T"(X) = 
(San)* converges in distribution to N(0, 1). oO 
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For independent 0 — 1 experiments with win probability p € (0,1), the 
central limit theorem is quite old. In this case 


: (S, — np) ~ [ 42/2 
lim P/————— < g] = — y/2 
cs 7 ear el 


as had been shown by de Moivre in 1730 in the case p = 1/2 and for general 
p € (0,1) by Laplace in 1812. It is a direct consequence of the central limit 
theorem: 


Corollary 2.14.5. (DeMoivre-Laplace limit theorem) The distribution of X* 
converges to the normal distribution if X, has the binomial distribution 


B(n, p). 


For more general versions of the central limit theorem, see [105]. The next 
limit theorem for discrete random variables illustrates, why Poisson dis- 
tribution on N is natural. Denote by B(n, p) the binomial distribution on 
{1,...,n } and with P, the Poisson distribution on N \ {0 }. 


Theorem 2.14.6 (Poisson limit theorem). Let X, be a B(n, pn)-distributed 
and suppose np, — a. Then X,, converges in distribution to a random 
variable X with Poisson distribution with parameter a. 


Proof. We have to show that P[X, = k] — P[X =k] for each fixed k € N. 


P[Xn =k] = (% ) eka —payt* 
= nein?) (n—k+1) k(1 —p,)?-' 
k 
~ jg (mpn)*(1~ “EByr* Fee 
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” “ 
oe 
. 
oat) 
Cs) 
on 
“2 oy) 
ry 
a 
[Sd ee —————E 


Figure. The binomial Figure. The binomial Figure. The  Pois- 


distribution B(2,1/2) yore B(5,1/5) son distribution 
has its support on has its support on wih a = 1 on 
{0, 1,2 }. {0,1,2,3,4,5 }. N= {0;,1,2;3;.... }. 


Exercice. It is custom to use the notation 
ae 2 
( ) x ( ) Jan - y 


for the distribution function of a random variable X which has the standard 
normal distribution N(0, 1). Given a sequence of IID random variables X,, 
with this distribution. 
a) Justify that one can estimate for large n probabilities 

Pla < S* <b] ~ ®(b) — B(a) . 


b) Assume X; are all uniformly distributed random variables in 0, 1]. 
Estimate for large n 
P{|S,/n —0.5| > €] 


in terms of ®,¢ and n. 
c) Compare the result in b) with the estimate obtained in the weak law of 
large numbers. 


Exercice. Define for \ > 0 the transformation 
Ty(uy(A) = ff ra) aa) do) 


in P = M,(R), the set of all Borel probability measures on R. For which A 
can you describe the limit? 
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2.15 Entropy of distributions 


Denote by v a (not necessarily finite) measure on a measure space (Q, A). 
An example is the Lebesgue measure on R or the counting measure on N. 
Note that the measure is defined only on a 6-subring of A since we did not 
assume that v is finite. 


Definition. A probability measure yu on R is called v absolutely continuous, 
if there exists a density f € £'(v) such that p = fv. If pL is v-absolutely 
continuous, one writes 4 < vy. Call P(v) the set of all v absolutely contin- 
uous measures. The set P(v) is the set of functions f € L1(v) satisfying 
f >Oand f f(z) dv(x) = 1. 


Remark. The fact that 4. < v defined earlier is equivalent to this is called 
the Radon-Nykodym theorem ((?]). The function f is therefore called the 
Radon-Nykodym derivative of 4 with respect to v. 


Example. If v is the counting measure N = {0,1,2,... } and v is the 
law of the geometric distribution with parameter p, then the density is 


| f(k) = p(l—p)F. 


Example. If v is the Lebesgue measure on (—00, 00) and yp is the law of 
the standard normal distribution, then the density is f(x) = en?’ /2 /V2n. 
There is a multi-variable calculus trick using polar coordinates, which im- 
mediately shows that f is a density: 


—(a?+y)/2 ae, 
e€ vl” drdy = € rd@dr = 2r. 
R2 o Jo 


Definition. For any probability measure , € P(v) define the entropy 


H(u) = if — f(w) log(f(w)) dv(w) . 


It generalizes the earlier defined Shannon entropy, where the assumption 
had been dv = dz. 


Example. Let v be the counting measure on a countable set 2, where A 
is the o-algebra of all subsets of 2 and let the measure v is defined on the 
6-ring of all finite subsets of Q. In this case, 


H(u) = D> -F(w) log(f(w)) . 
wen. 


For example, for Q = N = {0,1,2,3,... } with counting measure v, the 
geometric distribution P[{k}] = p(1— p)* has the entropy 


>> -(1 — p)*plog((1 — p)*p) = log(——) — ee D 
k=0 Pp p 
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Example. Let v be the Lebesgue measure on R. If p = fdz has a density 
function f, we have : 


H(u) = [ — f(a) log( f(a) de 


For example, for the standard normal distribution y with probability den- 
sity function f(x) = yee /2 the entropy is H(f) = (1 + log(2m))/2. 


us 


Example. If v is the Lebesgue measure dz on 0. = Rt = (0,00). A random 
variable on with probability density function f(z) = de is called the 
exponential distribution. It has the mean 1 /X. The entropy of this distri- 
bution is (log(A) — 1)/A. 


Example. If v is a probability measure on R, f a density and 
AS tAj, Ant 


is a partition on R. For the step function 
Fadl] fala € Sv), 
i=l 7 Ai 


the entropy H(fv) is equal to 


H({Ai}) = 55 -(Ai) log(v(Ai)) 


2 


which is called the entropy of the partition {A;}. The approximation of the 
density f by a step functions f is called coarse graining and the entropy 
of f is called the coarse grained entropy. It has first been considered by 
Gibbs in 1902. 


Remark. In ergodic theory, where one studies measure preserving trans- 
formations T of probability spaces, one is interested in the growth rate of 
the entropy of a partition generated by A,T(A), ..,T"(A). This leads to 
the notion of an entropy of a measure preserving transformation called 
Kolmogorov-Sinai entropy. 


Interpretation. Assume that 2 is finite and that v the counting measure 
and p({w}) = f(w) the probability distribution of random variable de- 
scribing the measurement of an experiment. If the event {w} happens, then 
—log(f(w)) is a measure for the information or ”surprise” that the event 
happens. The averaged information or surprise is 


H(u) = >> —f(w) log(f(w)) - 
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If f takes only the values 0 or 1, which means that p is deterministic, 
then H(j:) = 0. There is no surprise and what the measurements show, is 
the reality. On the other hand, if f is the uniform distribution on Q, then 

(1) = log(|Q|). We will see in a moment that this is the maximal entropy. 


Definition. Given two probability measures » = fv and ji = fv which are 
both absolutely continuous with respect to v. Define the relative entropy 


fw) 


oy, d(x) € [0,0]. 


H (alu) = f Fle)ioet 


It is the expectation E,[!] of the Likelihood coefficient | = log( £2} (2). The 


negative relative entropy —H(ji|z) is also called the coaditioaal © entropy. 
One writes also H(f|f) instead of H(jilj:). 


Theorem 2.15.1 (Gibbs inequality). 0 < H(j|u) < +oo and H(ji|u) = 0 if 
and only if p = ji. 


Proof. We can assume H (ji|:) < 00. The function u(x) = x log(x) is convex 
on R* = (0, oo) and satisfies u(r) > 2 — 1. 


w= fit (uh) aw > [ Hlwy( $2 1a pt 


If « = ji, then f = f almost everywhere and H (ji|u) = 0 
On the other hand, if H(ji|) = 0, then by the Jensen inequality (2.5.1) 


0 = E,[u(F)] > wE,(4)) = u(t) = 0, 


Therefore, E,, [u(4)] = u(E,[4]). The strict convexity of u implies that £ 
must be a constant. Since both f and f are densities, we have f = ‘a Ed 


Remark. The relative entropy can be used to measure the distance between 
two distributions. It is not a metric although. The relative entropy is also 
known under the name Kullback-Leibler divergence or Kullback-Leibler 
metric, if v = dz [85]. 
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Theorem 2.15.2 (Distributions with maximal entropy). The following dis- 
tributions have maximal entropy. 

a) If 2 is finite with counting measure v. The uniform distribution on 2 
has maximal entropy among all distributions on 2. It is unique with this 


property. 


b) 2 = N with counting measure v. The geometric distribution with 
parameter p = c~! has maximal entropy among all distributions on 


N = {0,1,2,3,... } with fixed mean c. It is unique with this property. 

ec) Q= {0, yy with counting measure v. The product distribution 7%, 
where 7(1) = p,n(0) = 1—p with p=c/N has maximal entropy among all 
distributions satisfying E[Sv] = c, where Sy(w) = pat w,;. It is unique 
with this property. 

d) 2 = (0, 00) with Lebesgue measure v. The exponential distribution with 
density f(x) = Ae~** with parameter \ on 2 has the maximal entropy 
among all distributions with fixed mean c = 1/X. It is unique with this 
property. 

e) Q = R with Lebesgue measure v. The normal distribution N(m,o 2) 
has maximal entropy among all distributions with fixed mean m and fixed 
variance o7. It is unique with this property. 

f) Finite measures. Let ({,.A) be an arbitrary measure space for which 
0 < »(Q) < co. Then the measure v with uniform distribution f = 1/v(Q) 
has maximal entropy among all other measures on 2. It is unique with this 
property. 


Proof. Let = fv be the measure of the distribution from which we want 
to prove maximal entropy and let # = fv be any other measure. The aim 
is to show H(ji|u) = H() — H(j) which implies the maximality since by 
the Gibbs inequality lemma (2.15.1) H(jilu) > 0. 

In general, 


H(filu) = —H(A) — [ F(w)log(f(w)) dv 


so that in each case, we have to show 


H(u) =~ [ Fle)tos(F(w)) av (28) 


With 
A(ji|n) = H(u) — H(f) 


we also have uniqueness: if two measures ji, p have maximal entropy, then 
H (ji\) = 0 so that by the Gibbs inequality lemma (2.15.1) uw = j. 


a) The density f = 1/|Q| is constant. Therefore H(z) = log(|9|) and equa- 
tion (2.8) holds. 
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b) The geometric distribution on N = {0,1,2,... } satisfies P[{k}] = f(k) = 
p(1 —p)*, so that 


y F(w)log(f(w)) dv = log(p) + i F(w)klog(1 — p) av 
= log(p) — log(1 —p) [ Fw) dv(w) 
=: tg D 


which is also the entropy of yp. 
c) The discrete density is f(w) = p®¥(1 — p)N—5* so that 


log(f(k)) = Sy log(p) + (N — Sn) log(1 — p) 


>> F(k) log(f(k)) = ElSy] log(p) + (N — E[Sw]) log(1 — p) - 
k 


The claim follows since we fixed E[Sy]. 


d) The density is f(z) = ae~°*, so that log(f(z)) = log(a) — az. The 
claim follows since we fixed E[X] = f x dji(x) was assumed to be fixed for 
all distributions. 


e) For the normal distribution log(f(z)) = a + b(z — m)? with two real 
number a,b depending only on m and a. The claim follows since we fixed 
Var[X] = E[(x — m)?] for all distributions. 


f) The density f = 1 is constant. Therefore H() = 0 which is also on the 
right hand side of equation (2.8). O 


Remark. This result has relations to the foundations of thermodynamics, 
where one considers the phase space of N particles moving in a finite region 
in Euclidean space. The energy surface is then a compact surface Q and the 
motion on this surface leaves a measure v invariant which is induced from 
the flow invariant Lebesgue measure. The measure v is called the micro- 
canonical ensemble. According to f) in the above, it is the measure which 
maximizes entropy. 


Remark. Let us try to get the maximal distribution using calculus of vari- 
ations. In order to find the maximum of the functional 


H(f)=- iE flog(f) dv 


on £}(v) under the constraints 


P()= f fav=1, ai) = [ xfdv=e, 
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we have to find the critical points of H = H — AF — yG In infinite dimen- 
sions, constrained critical points are points, where the Lagrange equations 


O re] OQ 
ape) = APP + HaeG(F) 
F(f) 1 


G(f) = ¢ 


are satisfied. The derivative 0/Of is the functional derivative and A, yz are 
the Lagrange multipliers. We find (f,A,v) as a solution of the system of 
equations 


—1—log(f(z)) = A+yus, 
[i@ae@ = 1, 
2 
[=f dp(Z)), = <¢ 
Q 


by solving the first equation for /: 
f — e7 A Heth 


feree dv(x) 
jaa dvix) = ¢ 


dividing the third equation by the second, so that we can get y from the 
equation f re~#*z dv(x) = c f e~*) dv(z) and A from the third equation 
elt — f e-#® dy(z). This variational approach produces critical points of 
the entropy. Because the Hessian D?(H) = —1/f is negative definite, it is 
also negative definite when restricted to the surface in £) determined by 
the restrictions F = 1,G = c. This indicates that we have found a global 
maximum. 


ll 
— 


Example. For 2 = R, X(zx) = x”, we get the normal distribution N(0, 1). 


Example. For 2 = N, X(n) = €n, we get f(n) = e7*°™/Z(f) with Z(f) = 
>, €*"*! and where A; is determined by 5, ene~**! = c. This is called 
the discrete Maxwell-Boltzmann distribution. In physics, one writes \~! = 
kT with the Boltzmann constant k, determining 7’, the temperature. 


Here is a dictionary matching some notions in probability theory with cor- 
responding terms in statistical physics. The statistical physics jargon is 
often more intuitive. 


Densities of maximal entropy | Thermodynamic equilibria 
Central limit theorem Maximal entropy principle 
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Distributions, which maximize the entropy possibly under some constraint 
are mathematically natural because they are critical points of a variational 
principle. Physically, they are natural, because nature prefers them. From 
the statistical mechanical point of view, the extremal properties of entropy 
offer insight into thermodynamics, where large systems are modeled with 
statistical methods. Thermodyanamic equilibria often extremize variational 
problems in a given set of measures. 


Definition. Given a measure space (2,.A) with a not necessarily finite 
measure v and a random variable X € L. Given f € L! leading to the 
probability measure » = fv. Consider the moment generating function 
Z(A) = E,{e**] and define the interval A = {\ € R| Z(A) < co} in R. 
For every  € A we can define a new probability measure 


ex 
= fry = Foyt 
on 22. The set 
{ux | AEA} 


of measures on (2, A) is called the exponential family defined by v and X. 


Theorem 2.15.3 (Minimizing relative entropy). For all probability measures 
ft which are absolutely continuous with respect to v, we have for all Ac A 


A(ja|u) — AEG[X] 2 — log Z(A) . 


The minimum — log Z(A) is obtained for py. 


Proof. For every ji = fv, we have 


ei. 
_H (alms) + (—log(Z(A)) + AEg[X)) - 


Il 


A (ji|) 


II 


For ji = py, we have 
A (pal) = — log(Z(A)) + AE, [X] . 
Therefore 
H(ji|u) — AEG [X] = H(fi|py) — log(Z(A)) = — log Z()) . 


The minimum is obtained for fi = p. O 
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Corollary 2.15.4. (Minimizers for relative entropy) 

a) j1, minimizes the relative entropy ji +> H(ji|u) among all v-absolutely 
continuous measures ji with fixed E,[X]. 

b) If we fix \ by requiring E,,[X] = c, then 4, maximizes the entropy 
H(j) among all measures ji satisfying Eg[X] = c. 


Proof. a) Minimizing ji + H(ji|u) under the constraint Ez |X] = c is equiv- 
alent to minimize 

H (ji|u) a AEg[X], 
and to determine the Lagrange multiplier 4 by E,,[X] = c. The above 


theorem shows that jz) is minimizing that. 
b) If w= fv, py =e ** f /Z, then 


0 < (ji, uy) = —H (jt) + (—log(Z)) — AE,[X] = —H (a) + H(ua). 


Corollary 2.15.5. If v = p is a probability measure, then 4, maximizes 


F(u) = H(u) + AB, 1X] 


among all measures ji which are absolutely continuous with respect to yp. 


Proof. Take = v. Since then f = 1, H(é|u) = —H (jf). The claim follows 
from the theorem since'a minimum of H (ilu) — 3 AE; [|X] corresponds to a 
maximum of F'(j). 


This corollary can also be proved by calculus of variations, namely by 
finding the minimum of F(f) = f f log(f) + Xf dv under the constraint 


ffd=1. 


Remark. In statistical mechanics, the measure 1) is called the Gibbs distri- 
bution or Gibbs canonical ensemble for the observable X and Z(A) is called 
the partition function. In physics, one uses the notation X = —(kT)7! 

where T is the temperature. Maximizing H(u) — (kT)~1E,,[X] is the same 
as minimizing E,[X] — kT H(u) which is called the free energy if X is 
the Hamiltonian and E,,[X] is the energy. The measure yp is the a priori 
model, the micro canonical ensemble. Adding the restriction that X has 
a specific expectation value c = E,,[X] leads to the probability measure 
#4), the canonical ensemble. We illustrated two physical principles: nature 
maximizes entropy when the energy is fixed and minimizes the free energy, 
when energy is not fixed. 
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Example. Take on the real line the Hamiltonian X(x) = x? and a measure 
uw = fdz, we get the energy f{ zx? du. Among all symmetric distributions 
fixing the energy, the Gaussian distribution maximizes the entropy. 


Example. Let 9 = N = {0,1,2,... } and X(k) = k and let v be the 
counting measure on {2 and p the Poisson measure with parameter 1. The 
partition function is . 


-1 
e ; 
Z(A) = y en = exp(e* — 1) 
: ! 


so that A = R and py is given by the weights 


1 k 


_ e7 Se 
p(k) = exp(e ase Ne ae =e ? 


where a = e* = 0. The exponential family of the Poisson measure is the 
family of all Poisson measures. 


Example. The geometric distribution on N = {0,1,2,3,... } is an expo- 
nential family. 


Example. The product measure on 2 = {0,1 }% with win probability p is 
an exponential family with respect to X(k) = k. 


Example. 2 = {1,...,N}, v the counting measure and let pp be the bino- 
mial distribution with p. Take ps = 4/2 and X(k) = k. Since 


0 


IA 


A (ji\u) = H(f|up) + log(p)E[X] + log(1 — p)E[(N — E[X])] 
~H(jilup) + H(up) , 


Hp is an exponential family. 


Remark. There is an obvious generalization of the maximum entropy prin- 
ciple to the case, when we have finitely many random variables {X;}7_). 
Given ps = fv we define the (n-dimensional) exponential family 


= fy = tie 


where 


Z(A) = Byler 0%] 


is the partition function defined on a subset A of R”. 
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Theorem 2.15.6. For all probability measures / which are absolutely con- 
tinuous with respect to v, we have for all A€ A 


H (jl) — 23 MEg[Xi] => —log Z(A) . 


The minimum —log Z(A) is obtained for yy. If we fix A; by requiring 
Ex, [Xi] = ci, then #, maximizes the entropy H(j) among all measures 
ji satisfying Eg[X;] = ci. 

Assume v = p is a probability measure. The measure , maximizes 


F(ji) = H(ji) + B_lX). 


Proof. Take the same proofs as before by replacing AX with A». X 


0 ll 


yg AG: 
2.16 Markov operators 
Definition. Given a not necessarily finite probability space (Q,A,v). A 


linear operator P : £!() + £1(Q) is called a Markov operator, if 


f20>Pf20, 
f > 0=> ||PF\l, = Ulf. 


Remark. In other words, a Markov operator P has to leave the closed 
positive cone invariant C1 = {f € CL’ | f > 0} and preserve the norm on 
that cone. 


Remark. A Markov operator on (Q,.A,v) leaves invariant the set D(v) = 
{f € Le | f > 0,||f\l1 = 1 } of probability densities. They correspond 
bijectively to the set P(v) of probability measures which are absolutely 
continuous with respect to v. A Markov operator is therefore also called a 
stochastic operator. 


Example. Let T be a measure preserving transformation on (Q, A, v). It is 
called nonsingular if T*v is absolutely continuous with respect to v. The 
unique operator P: £! > L! satisfying 


[rte=[ tw 


is called the Perron-Frobenius operator associated to T. It is a Markov 
operator. Closely related is the operator Pf(x) = f(Tx) for measure pre- 
serving invertible transformations. This Koopman operator is often studied 
on L?, but it becomes a Markov operator when considered as a transfor- 
mation on £}. 
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Exercice. Assume 2 = [0,1] with Lebesgue measure pu. Verify that the 
Perron-Frobenius operator for the tent map 


7 2 x € [0,1/2 
TOS aac wes ened 


is Pf(x) = 9(f(g2) + fl - 32). 


Here is an abstract version of the Jensen inequality (2.5.1). It is due to M. 
Kuczma. See [61]. 


I 


Theorem 2.16.1 (Jensen inequality for positive operators). Given a convex 
function wu and an operator P : £) -— LC’ mapping positive functions into 
positive functions satisfying P1 = 1, then 


u(Pf) < Pu(f) 
for all f € C4. for which Pu(f) exists. 


i 


Proof. We have to show u(Pf)(w) < Pu(f)(w) for almost all w € ©. Given 
x = (Pf)(w), there exists by definition of convexity a linear function y > 
ay +b such that u(z) = ax +b and u(y) > ay +6 for ally € R. Therefore, 
since af +b < u(f) and P is positive 
u(Pf)(w) = a(Pf)(w) + b = Plaf + )(w) < P(u(f))) - 
0 
The following theorem states that relative entropy does not increase along 


orbits of Markov operators. The assumption that {f > 0} is mapped into 
itself is actually not necessary, but simplifies the proof. 


on 


Theorem 2.16.2 (Voigt, 1981). Given a Markov operator P which maps 
{ f > 0} into itself. For all f,g € £4, 


H(Pf|Pg) < H(flg) - 


Proof, We can assume that {g(w) = 0} C A = {f(w) = 0} because nothing 
is to show in the case H(f|g) = oo. By restriction to the measure space 
space (AC, AN A°,v(-M A)), we can assume f > 0,9 > 0 so that by our 
assumption also Pf > 0 and Pg > 0. 
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(i) Assume first (f/g)(w) < c for some constant c € R. 

For fixed g, the linear operator Rh = P(hg)/P(g) maps positive functions 
into positive functions. Take the convex function u(z) = xlog(x) and put 
h = f/g. Using Jensen’s inequality, we get 


ol og = u(Rh) < Ru(h) = aie 


which is equivalent to Pf log af < P(f log(f/g)). Integration gives 


ij 


H(Pf|Pq) [Pree st dv 


A 


< / P(flog(f/9)) dv = i flog(f/g) dv = H(flg) . 


(ii) Define fr = inf(f,kg) so that fe/g < k. We have fk C feyi and 
fe — f in £L’. From (i) we know that H(Pf,|Pg) < H(fxlg). We can 
assume H(f|g) < oo because the result is trivially true in the other case. 
Define B = {f < g}. On B, we have f;, log(fx/g) = f log(f/g) and on Q\B 


we have 
fe log(fi/9) < fr+1log(fesi/g)u > f log(f/g) 


so that by Lebesgue dominated convergence theorem, 
H(f\g) = lim H(felg) - 
OO 


As an increasing sequence, Pf, converges to Pf almost everywhere. The 
elementary inequality x log(x) — x > xlog(y) — y for all x > y > 0 gives 


(Pfr) log(P fr) — (Pfr) log(P9) — (Pf) + (Pg) > 0. 
Integration gives with Fatou’s lemma (2.4.2) 
H(Pf|P9) ~ ||Pfil + ||Pgll < limint H(P f,|P9) — ||P full + l\Pall 


and so H(Pf|Pg) < liminf,_... H(P fx|Pa). O 


Corollary 2.16.3. For an invertible Markov operator P, the relative entropy 
is constant: H(Pf|Pg) = H(f\g). 


Proof. Because P and P~! are both Markov operators, 


H(f\g) = H(PP~'f|PP~*g) < H(P™'f|P-'g) < H(fl|g) - 


110 Chapter 2. Limit theorems 


Example. If a measure preserving transformation T is invertible, then the 
corresponding Koopman operator and Perron-Frobenius operators preserve 
relative entropy. 


Corollary 2.16.4. The operator T(u)(A) = fa.1 A(=#) du(x) du(y) does 
not decrease entropy. 


Proof. Denote by X,, a random variable having the law p and with p(X) 
the law of a random variable. For a fixed random variable Y, we define the 
Markov operator 


Xpt+tY 
Py = e . 
y (H) = wl 953 ) 
Because the entropy is nondecreasing for each Py, we have this property 
also for the nonlinear map T(y) = Px, (u). Oo 


We have shown as a corollary of the central limit theorem that T has a 
unique fixed point attracting all of Po1. The entropy is also strictly in- 
creasing at infinitely many points of the orbit T"(j) since it converges to 
the fixed point with maximal entropy. It follows that T is not invertible. 


More generally: given a sequence X,, of IID random variables. For every n, 
the map P, which maps the law of S* into the law of S*,, is a Markov 
operator which does not increase entropy. We can summarize: summing up 
IID random variables tends to increase the entropy of the distributions. 

A fixed point of a Markov operator is called a stationary state or in more 
physical language a thermodynamic equilibrium. Important questions are: 
is there a thermodynamic equilibrium for a given Markov operator P and 
if yes, how many are there? 


2.17 Characteristic functions 


Distribution functions are in general not so easy to deal with, as for ex- 
ample, when summing up independent random variables. It is therefore 
convenient to deal with its Fourier transforms, the characteristic functions. 
It is an important topic by itself [60]. 


Definition. Given a random variable X, its characteristic function is a real- 
valued function on R defined as 


ox(u) = Ble™*] . 


If Fx is the distribution function of X and px its law, the characteristic 
function of X is the Fourier-Stieltjes transform 


x(t) = fe. dF x(x) = ee px (dz) . 
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Remark. If Fx is a continuous distribution function dF x(x) = fx(x) dz, 
then ¢x is the Fourier transform of the density function fx: 


pee fx (x) dz 
R 


Remark. By definition, characteristic functions are Fourier transforms of 
probability measures: if uw is the law of X, then dx = fi 


Example. For a random variable with density fx(x) = 2™/(m+ 1) on 
= [0, 1] the characteristic function is 


m!(1 — e*€e,(—it)) 


oxtt)= fe ett a defen + 1) = mn pay 


where €n(x) = *y9 z* /(k!) is the n’th partial exponential function. 


Theorem 2.17.1 (Lévy formula). The characteristic function ¢x determines 
the distribution of X. If a,b are points of continuity of F’, then 


oo evita = etd 


Fx (8) - F(a) = 5 [ 


2S at 


x(t) dt. (2.9) 


In general, one has 
1 oo .—ita —itb 


= <4 y-(t) dt = l(a, b)] + sul{a}] + SHO} 


Proof. Because a distribution function F' has only countably many points of 
discontinuities, it is enough to determine F'(b) — F(a) in terms of ¢ if a and 
b are continuity points of F. The verification of the Lévy formula is then 
a computation. For continuous distributions with density FY. = fx is the 
inverse formula for the Fourier transform: fx(a) = s2 f° e~"*@x(t) dt 


so that Fx(a) = + f™ — x(t) dt. This proves the inversion formula 
if a and b are points of continuity. 

The general formula needs only to be verified when pu is a point measure 
at the boundary of the interval. By linearity, one can assume yp is located 
on a single point 6 with p = P[X = 6] > 0. The Fourier transform of the 


Dirac measure péy is x (t) = pe’. The claim reduces to 


e 


1 oo e tte _ —itb 
27 Jo it 


pe? dt = P 


R ae ~1 


which is equivalent to the claim limp_.oo fon dt = m forc > 0. 
Because the imaginary part is zero for every R by Symmetry: only 


R. 
lim / SD 
cape a 


R- 00 
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remains. The verification of this integral is a prototype computation in 
residue calculus. O 


Theorem 2.17.2 (Characterization of weak convergence). A sequence Xp, 
of random variables converges weakly to X if and only if its characteristic 
functions converge point wise: 


bx, (£) > x . 


Proof. Because the exponential function e“* is continuous for each ¢, it 
follows from the definition that weak convergence implies the point wise 
convergence of the characteristic functions. From formula (2.9) follows that 
if the characteristic functions converge point wise, then convergence in dis- 
tribution takes place. We have learned in lemma (2.13.2) that weak con- 
vergence is equivalent to convergence in distribution. O 
Example. Here is a table of characteristic functions (CF) ¢x(t) = E[e**] 
and moment generating functions (MGF) Mx(t) = Ele’*] for some familiar 
random variables: 


Normal méER,o?>0 Je 
ae oP 


| sin(at)/(at) sinh(at) /(at) 
Exponential | A > 0 A/(A — it) 


binomial 


Definition. Let F and G be two probability distribution functions. Their 
convolution F x G is defined as 


F«G(x)= [Fe —y) dG(y). 


Lemma 2.17.3. If F and G are distribution functions, then F * G is again 
a distribution function. 


2.17. Characteristic functions 113 


Proof. We have to verify the three properties which characterize distribu- 
tion functions among real-valued functions as in proposition (2.12.1). 

a) Since F is nondecreasing, also F x G is nondecreasing. 

b) Because F(—0o) = 0 we have also F x G(—o0o) = 0. Since F'(oo) = 1 and 
dG is a probability measure, also F' * G(oo) = 1. 

c) Given a sequence h, — 0. Define F,(x) = F(a + hn). Because F is con- 
tinuous from the right, F,,(2) converges point wise to F(x). The Lebesgue 
dominated convergence theorem implies that F, * G(r) = F * G(x + hn) 
converges to F x G(x). O 


Example. Given two discrete distributions 
F(z) = Pa G(x) = So an : 
n<a nsx 


Then FxG(zx) = re 7 (P*q)n, where pxq is the convolution of the sequences 
p,q defined by (p * q)n = SY p-9 Pk9n—k. We see that the convolution of 
discrete distributions gives again a discrete distribution. 


Example. Given two continuous distributions F,G with densities 4 and k. 
Then the distribution of F x G is given by the convolution 


h«k(z) = [re — y)k(y) dy 
because 


(+6) (2) = [ P@- wey) dy = fhe —)RW) a 


Lemma 2.17.4. If F and G are distribution functions with characteristic 
functions ¢ and w, then F' * G has the characteristic function ¢- w. 


Proof. While one can deduce this fact directly from Fourier theory, we 
prove it by hand: use an approximation of the integral by step functions: 


[otra +ey(c) 
R 
N2" 
. —n k k-1 
aa li tuk2 eons 
ylim, Yo il Foe —¥) — Fem — WN) dG) 
k=—N2"41 
N2” oe k k- . 
= lim, De fete -) Fe ail aay) 
k=—N2"41 


= [ign fo eur dF(x)|e*™ dG(y) = [ower dG(y) 
b(u)p(u) . 
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It follows that the set of distribution functions forms an associative com- 
mutative group with respect to the convolution multiplication. The reason 
is that the characteristic functions have this property with point wise mul- 
tiplication. 


Characteristic functions become especially useful, if one deals with inde- 
pendent random variables. Their characteristic functions multiply: 


Proposition 2.17.5. Given a finite set of independent random variables 
X;,j =1,...,n with characteristic functions ¢;. The characteristic func- 


tion of 07, X; is d= TTja1 $3: 


Proof. Since X; are independent, we get for any set of complex valued 
measurable functions g;, for which E[g;(X;)] exists: 


nm 


El] [ 95(X3)1 = [] Els(X5)] . 


j=l Jol 
Proof: This follows almost immediately from the definition of independence 
since one can check it first for functions gj = 14,, where A; are o(X; 
measurable functions for which 9;(X;)gx(X%) = lajna, and 


Elgj(Xj)gn(Xm)] = m(Aj)m(Ax) = Elg;(X5)]Elgn(Xe)] 5 


then for step functions by linearity and then for arbitrary measurable func- 
tions. 


If we put g;(x) = exp(iz), the proposition is proved. O 


Example. If X,, are IID random variables which take the values 0 and 2 with 
probability 1/2 each, the random variable X = S772, X,/3” is a random 
variable with the Cantor distribution. Because the characteristic function 


of Xn is x, /3n(t) = Ele**"/3"] = — we see that the characteristic 
function of X is 


oO 12/3" 4 


ox) 


i=1 
The centered random variable Y = X — 1/2 can be written as Y = 
yi, Yn/3", where Y,, takes values —1,1 with probability 1/2. So 


: i/3” 4 9-1/3” oo t 
t) =] ] Ble*”/3") = Ae ea os(—). 
v(t) = [Jee] =] TL sgn) 


n 
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This formula for the Fourier transform of a singular continuous measure ju 
has already been derived by Wiener. The Fourier theory of fractal measures 
has been developed much more since then. 


1.0 
Figure. The characteristic func- | 
tion dy(t) of a random variable "t 
Y with a centered Cantor distri- os 
bution supported on [-1/2, 1/2] | 
has an explicit formula dy (t) = va 
T]p21 cos(s) and already been ol 


derived by Wiener in the early 
20’th century. The formula can = i F 
also be used to compute moments 7 

of Y with the moment formula 
E[X™] = (—4)" $5 4x (t)|t=0- yi 


=e 


Corollary 2.17.6. The probability density of the sum of independent ran- 
dom variables )07_, Xj is fi * fo*--+* fn, if X; has the density f;. 


Proof. This follows immediately from proposition (2.17.5) and the alge- 
braic isomorphisms between the algebra of characteristic functions with 
convolution product and the algebra of distribution functions with point 
wise multiplication. im 


Example. Let Y;, be IID random variables and let X, = APY, withO<A< 
1. The process S,, = )>;_., Xx is called the random walk with variable step 
size or the branching random walk with exponentially decreasing steps. Let 
pu be the law of the random sum X = 70>? , Xx. If dy (t) is the characteristic 
function of Y, then the characteristic function of X is 


CO 
x(t) = [] ox(ea”). 
n=1 
For example, if the random Y,, take values —1, 1 with probability 1/2, where 
gy (t) = cos(t), then 


CO 
ox(t) = I] cos(tA”) . 
n=1 
The measure y is then called a Bernoulli convolution. For example, for 
.A = 1/3, the measure is supported on the Cantor set as we have seen 
above. For more information on this stochastic process and the properties 
of the measure ys which in a subtle way depends on \, see [41]. 
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Exercice. Show that X,, — X in distribution if and only if the distribution 
functions satisfy dx, (t) + x(t) for all te R. 


Exercice. The characteristic function of a vector valued random variable 
X = (Xj,..., Xx) is the real-valued function 


ox (t) = Efe**] . 


on R*, where we wrote t = (t),...,t,). Two such random variables X,Y 
are independent, if the -algebras X~'(B) and Y~'(B) are independent, 
where B is the Borel o-algebra on R*. 
a) Show that if X and Y are independent then ¢x,y = x - dy. 
b) Given a real nonsingular k x k matrix A called the covariance matrix 
and a vector m = (mj ,...,m,) called the mean of X. We say, a vector 
valued random variable X has a Gaussian distribution with covariance A 
and mean m, if 

x(t) ae eimt—3(tAt) ; 


Show that the sum X + Y of two Gaussian distributed random variables is 
again Gaussian distributed. 

c) Find the probability density of a Gaussian distributed random variable 
X with covariance matrix A and mean m. 


Exercice. The Laplace transform of a positive random variable X > 0 is 
defined as l(t) = E{e~'*]. The moment generating function is defined as 
M(t) = Ele’*] provided that the expectation exists in a neighborhood of 
(0. The generating function of an integer-valued random variable is defined 
as C(X) = Efu*] for wu € (0,1). What does independence of two random 
variables X,Y mean in terms of (i) the Laplace transform, (ii) the moment 
generating function or (iii) the generating function? 


Exercice. Let (2,.A,,) be a probability space and let U,V € 4 be ran- 
dom variables (describing the energy density and the mass density of a 
thermodynamical system). We have seen that the Helmholtz free energy 


E,{U] — KT H(A 


(k is a physical constant), T is the temperature, is taking its minimum for 
the exponential family. Find the measure minimizing the free enthalpy or 
Gibbs potential 

E,(U] — KTH (a — pE, IV) 


where p is the pressure. 
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Exercice. Let ((,.A, 1) be a probability space and X; € £ random variables. 
Compute E,,[X;] and the entropy of py in terms of the partition function 
Z(X) 


Exercice. a) Given the discrete measure space (2 = {eo + nd},v), with 
€9 € R and 6 > 0 and where v is the counting measure and let X (k) =k. 
Find the distribution f maximizing the entropy H(f) among all measures 
j= fv fixing Eg[X] =e. 

b) The physical interpretation is as follows: 12 is the discrete set of ener- 
gies of a harmonic oscillator, €9 is the ground state energy, 6 = fw is the 
incremental energy, where w is the frequency of the oscillation and A is 
Planck’s constant. X(k) = k is the Hamiltonian and E[X] is the energy. 
Put A = 1/kT, where T is the temperature (in the answer of a), there ap- 
pears a parameter A, the Lagrange multiplier of the variational problem). 
Since can fix also the temperature T instead of the energy e, the distribu- 
tion in a) maximizing the entropy is determined by w and T. Compute the 
spectrum e€(w, 7) of the blackbody radiation defined by 


e(w,T) = (E[X] — oss 


where c is the velocity of light. You have deduced then Planck’s blackbody 
radiation formula. 


2.18 The law of the iterated logarithm 


We will give only a proof of the law of iterated logarithm in the special 
case, when the random variables X, are independent and have all the 
standard normal distribution. The proof of the theorem for general IID 
random variables X;,, can be found for example in [105]. The central limit 
theorem makes the general result plausible from the special case. 


Definition. A random variable X € CL is called symmetric if its law ux 
satisfies: 

u([—b, —a)) = p([a, 6)) 
for all a < 6. A symmetric random variable X € £ has zero mean. We 
again use the notation S,, = poe Xj in this section. 


Lemma 2.18.1. Let X, by symmetric and independent. For every « > 0 


Pl ae Sk > €] < 2P[(Sp > €] . 
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Proof. This is a direct consequence of Lévy’s theorem (2.11.6) because we 
can take m = 0 as the median of a symmetric distribution. O 


Definition. Define for n > 2 the constants A, = \/2nloglog n. It grows only 
slightly faster than V2n. For example, in order that the factor /loglogn 
is 3, we already have n = exp(exp(9)) > 1.33 - 105519. 


Theorem 2.18.2 (Law of iterated logarithm for N(0,1)). Let Xp, be a se- 
quence of IID N(0, 1)-distributed random variables. Then 


lim sup as =, iitaine 2° =—-1. 
n—0o An Te? 00 n 


Proof. We follow [47]. Because the second statement follows obviously from 
the first one by replacing X, by —Xn, we have only to prove 


limsup S,/An =1. 


n—00 

(i) P[S, > (1+ €)An, infinitely often] = 0 for all € > 0. 

Define ni, = [(1+.)*] € N, where [z] is the integer part of z and the events 
Ax = {Sn > (1 +€)An, for some n € (nx, Ne +1] }. 

Clearly lim sup, Ay = {Sp > (1+e)An, infinitely often}. By the first Borel- 


Cantelli lemma (2.2.2), it is enough to show that }°>, P[Ax] < oo. For each 
k, we get with the above lemma 


PIA] < Oa Sn > (14+ Ax] 
ss 
< 2P[Sniar > (1 + €)Ag| : 


The right-hand side can be estimated further using that S,,,,/./Me+1 
is N(0,1)-distributed and that for a N(0,1)-distributed random variable 


P[X >t) < const -e-t’/2 


Cos Jing logloeng 
2P [Sng ar > Ae] 2p|( a > (1+ 0) ET |) 


Vneri Vnes1 


1 2nx log | 

< Cexp(-=(1+ Pa) pelea ae =A oe 
2 Nke+1 

< Cyexp(—(1 + €) log log(n;)) 


Cy log(ng)~AF® < Cok7 Ot , 
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Having shown that P[Ax] < const-k—“+® proves the claim 5°, P[Ax] < 00. 


(ii) P[S, > (1—€)An, infinitely often] = 1 for all « > 0. 
It suffices to show, that for all « > 0, there exists a subsequence nx 
P[Sn, > (1—©)An,, infinitely often] = 1. 
Given € > 0. Choose N > 1 large enough and c < 1 near enough to 1 such 


that 
ef1—1/N-2/VN >1-e. (2.10) 


Define ny = N*¥ and Ang = ng — ng-1. The sets 


Ak = (Sn, — Sng. > CV 2Ang log log Ang} 


are independent. In the following estimate, we use the fact that i e-@/2 de > 
C -e~*’/2 for some constant C. 


P[Ax] = P[{Sny — Sny_. > ev’ 2Ang log log Ang }} 
Sn, — Sng /2Anx log log An, 

= Pi{——— > e——_=—_—. 

lt VAnk : VAnk H 


C - exp(—c? log log Ang) = C'- exp(—c? log(k log N)) 
Cy -exp(—c? logk) = Cyk7* 


IV 


so that >, P[Ax] = 00. We have therefore by Borel-Cantelli a set A of full 
measure so that for w € A 


Snz — Snp_y > CV 2Ang log log Ang 


for infinitely many k. From (7), we know that 


Sn, > —2r/2nx log log nx 


for sufficiently large k. Both inequalities hold therefore for infinitely many 
values of k. For such k, 


Sny(w) > Snp-a(w) + e/2Ang log log Ang 
Sepa 
(-2/VN + ev/1 = 1/N) V/2n« log log nx 

(= 6)s/Fan a ere 


where we have used assumption (2.10) in the last inequality. 0 


IV IV IV V 


We know that N(0,1) is the unique fixed point of the map T by the central 
limit theorem. The law of iterated logarithm is true for T(X) implies that 
it is true for X. This shows that it would be enough to prove the theorem 
in the case when X has distribution in an arbitrary small neighborhood of 
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N(0, 1). We would need however sharper estimates. 


We present a second proof of the central limit theorem in the IID case, to 
illustrate the use of characteristic functions. 


OO eee 


Theorem 2.18.3 (Central limit theorem for IID random variables). Given 
Xn € L? which are IID with mean 0 and finite variance o?. Then 
Sn/(a/n) + N(0,1) in distribution. 


Sn a ss 


Proof. The characteristic function of N(0,1) is ¢(t) = e~*’/2. We have to 
show that for allte R 


Ele*7v5] set /2 


Denote by ¢x, the characteristic function of X,. Since by assumption 
E[X,,] = 0 and E[X2] = o?, we have 


ox, (t) =1- oe + o(t?). 


Therefore 
Ss t 
E toe = _* yn 
lereval 9x, (=F) 
1-4 i oes 
al or o(-)) 


a o(1). 
O 


This method can be adapted to other situations as the following example 
shows. 


Proposition 2.18.4. Given a sequence of independent events A, C 2 with 
P[A,] = 1/n. Define the random variables X, = 14, and S, = ek: 
Then 
T, = Sn — log(n) 
log(n) 


converges to N(0, 1) in distribution. 


Proof. 
ElSn] = J> 5 = log(n) ++ 0(1), 
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where = limn—oo op=1 i — log(n) is the Euler constant. 
2 


“1 
Var[S. 1= oz a-2) iar 
k=1 


satisfy E[T,] — 0 and Var[ | > 1. Compute dx, = 1— 4+ © so that 
ds, (t) = Ttay(1— 2 + $1) and dn,(t) = $5, (0(t))e-# 98, where s = 
t/./log(n). For n — co, we compute 


log dr, (t) = —itvlog(n J+ Sloat + is _1)) 


= —it,/log(n j++ i (is — 58? + o(s*))) 


mt 2 
~ity/log(n) + oe a (is # 5° +0(s?)) +O(S- 2) 
k=1 k=1 
3 einen) Gas =o + 0(s2))(log(n) + O(1)) + #20(1) 
= $e +o(1) > -30 


We see that T,, converges in law to the standard normal distribution. 0 
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Chapter 3 


Discrete Stochastic Processes 


3.1 Conditional Expectation 


Definition. Given a probability space (Q,.A,P). A second measure P’ on 
(Q, A) is called absolutely continuous with respect to P, if P[A] = 0 implies 
P’[A] = 0 for all A € A. One writes P’ < P. 


Example. If Pa, b] = b— a is the uniform distribution on Q = [0,1] and A 
is the Borel o-algebra, and Y € L! satisfies Y(x) > 0 for all x € ©, then 


P’[a, 6] = ie Y (a) dz is absolutely continuous with respect to P. 


Example. Assume P is again the Lebesgue measure on [0, 1] as in the last 
example. If Y(x) = 1g(z), then P’[A] = P[AMB] for all A € A. If P{[B] < 1, 
then P is not absolutely continuous with respect to P’. We have P’[B‘] = 0 
but P[B°] = 1— P[B] > 0. 


Example. If P’|A] = { ; se ° ; , then P’ is not absolutely continuous 


with respect to P. For B = {1/2}, we have P[B] = 0 but P’[B] = 1 #0. 


The next theorem is a reformulation of a classical theorem of Radon- 
Nykodym of 1913 and 1930. 


Theorem 3.1.1 (Radon-Nykodym equivalent). Given a measure P’ which 
is absolutely continuous with respect to P, then there exists a unique 
Y € £'(P) with P’ = YP. The function Y is called the Radon-Nykodym 
derivative of P’ with respect to P. It is unique in L}. 


Proof. We can assume without loss of generality that P’ is a positive mea- 
sure (do else the Hahn decomposition P = P+ — P~), where P+ and P7 


123 


124 Chapter 3. Discrete Stochastic Processes 
are positive measures). 
(i) Construction: We recall the notation E[Y; A] = E[l14 Y] = f, Y dP. 


The set [ = {Y > 0| E[Y; A] < P’[A], VA € A } is closed under formation 
of suprema 


EM V¥2;A]) = E[M;AN{N% > Yo} + B[¥o; AN {Yo > V%i}] 
< PAN{Y > Yo}]+P'[AN {Yo > Vi} = PA] 


and contains a function Y different from 0 since else, P’ would be singular 
with respect to P according to the definition (2.15) of absolute continuity. 
We claim that the supremum Y of all functions I’ satisfies YP = P’: an 
application of Beppo-Lévi’s theorem (2.4.1) shows that the supremum of I 
is in T. The measure P” = P’ — YP is the zero measure since we could do 
the same argument with a new set I for the absolutely continuous part of 
Pp”, ’ 

(ii) Uniqueness: assume there exist two derivatives Y,Y’. One has then 
E[Y — Y’;{Y > Y’}] = 0 and so Y > Y’ almost everywhere. A similar 
argument gives Y’ < Y almost everywhere, so that Y = Y’ almost every- 
where. In other words, Y = Y’ in L!. oO 


Theorem 3.1.2 (Existence of conditional expectation, Kolmogorov 1933). 
Given X € £'(A) and a sub o-algebra B C A. There exists a random 
variable Y € £'(B) with [, Y dP = [, X dP for all AE B. 


Proof. Define the measures P[A] = P[A] and P’[A] = f, X dP = E[X; A] 
on the probability space (2, B). Given a set B € B with P[B] = 0, then 
P’[B] = 0 so that P’ is absolutely continuous with respect to P. Radon- 


Nykodym’s theorem (3.1.1) provides us with a random variable Y € £’(B) 
with P/[A] = [, X dP = J, Y dP. Oo 


Definition. The random variable Y in this theorem is denoted with E[X |B] 
and called the conditional expectation of X with respect to B. The random 
variable Y € £'(B) is unique in L1(B). If Z is a random variable, then 
E[X|Z] is defined as E[X|o(Z)]. If {Z}7 is a family of random variables, 
then E[X|{Z}z] is defined as E[X|o({Z}7)]. 


Example. If B is the trivial o-algebra B = {0,9}, then E[X|B] = X. 
Example. If B = A, then E[X|B] = E[X]. 
Example. If B = {0,Y,Y°,Q} then 

mvy Jy X oP for weY, 


EX = 
BOK) fave for weY°. 
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Example. Let (Q,.A,P) = ([0,1] x [0,1], A,dzdy), where A is the Borel 
o-algebra defined by the Euclidean distance metric on the square 2. Let B 
be the o-algebra of sets A x (0, 1], where A is in the Borel o-algebra of the 
interval (0, 1]. If X(z, y) is a random variable on 2, then Y = E[X|B] is the 
random variable 


Y(0,y) = / X(a,y) dy. 


This conditional integral only depends on z. 


Remark. This notion of conditional entropy will be important later. Here 
is a possible interpretation of conditional expectation: for an experiment, 
the possible outcomes are modeled by a probability space (Q,.A) which is 
our ’laboratory”. Assume that the only information about the experiment 
are the events in a subalgebra B of A. It models the ”knowledge” obtained 
from some measurements we can do in the laboratory and B is generated by 
a set of random variables {Z;},.7 obtained from some measuring devices. 
With respect to these measurements, our best knowledge of the random 
variable X is the conditional expectation E[X |B]. It is a random variable 
which is a function of the measurements Z;. For a specific ” experiment 
w, the conditional expectation E[X|B](w) is the expected value of X(w), 
conditioned to the o-algebra B which contains the events singled out by 
data from X;. 


Proposition 3.1.3. The conditional expectation X ++ E[X |B] is the projec- 
tion from £?(.A) onto £?(B). 


Proof. The space £7(B) of square integrable B-measurable functions is a 
linear subspace of £?(A). When identifying functions which agree almost 
everywhere, then L?(B) is a Hilbert space which is a linear subspace of the 
Hilbert space L?(A). For any X € £7(A), there exists a unique projection 
p(X) € £?(B). The orthogonal complement £?(B)+ is defined as 


£7(B)t = {Z € £L7(A) | (Z,Y) = E[Z-Y] = 0 for all Y € £°(B) } . 
By the definition of the conditional expectation, we have for A € B 
(X — E[X|B],14) = ELX — E[X|B); A] =0. 
Therefore X — E[X|B] € £L?(B)+. Because the map q(X) = E[X |B] satisfies 
q? = q, it is linear and has the property that (1 — q)(X) is perpendicular 
to £7(B), the map q is a projection which must agree with p. O 


Example. Let 2 = {1,2,3,4} and A the o-algebra of all subsets of 2. Let 
B = {0, {1,3}, {2,4}, Q}. What is the conditional expectation Y = E[X|B] 
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of the random variable X(k) = k?? The Hilbert space £7(A) is the four- 
dimensional space R* because a random variable X is now just a vector 
X = (X(1), X (2), X(3), X(4)) = (1,4,9, 16). The Hilbert space £7(B) is 
the set of all vectors v = (v1, v2,v3,v4) for which v; = ve and v3 = v4 
because functions which would not be constant in (v1,v2) would gener- 
ate a finer algebra. It is the two-dimensional subspace of all vectors {v = 
(a, a,b, b) | a,b € R }. The conditional expectation projects onto that plane. 
The first two components (X (1), X(2)) project to (ASXO | XO EOD), 


the second two components project to (AOEKO | XOXO), Therefore, 


X(1)+X(2) X(1) + X(2) X(3)+X(4) X(3) + X(4) 


E[X|B] = ( Go ae ee Os 


Remark. This proposition 3.1.3 means that Y is the least-squares best B- 
measurable square integrable predictor. This makes conditional expectation 
important for controlling processes. If B is the o-algebra describing the 
knowledge about a process (like for example the data which a pilot knows 
about an plane) and X is the random variable (which could be the actual 
data of the flying plane), we want to know, then E/X|B] is the best guess 
about this random variable, we can make with our knowledge. 


Exercice. Given two independent random variables X,Y € C? such that 
X has the Poisson distribution P, and Y has the Poisson distribution P,,. 
The random variable Z = X + Y has Poisson distribution P+, as can 
be seen with the help of characteristic functions. Let B be the o-algebra 
generated by Z. Show that 


a 


Hint: It is enough to show 


E[X;{Z = &}] = rahe =e 


Even if random variables are only in £1, the next list of properties of 
conditional expectation can be remembered better with proposition 3.1.3 
in mind which identifies conditional expectation as a projection, if they are 
in £7; 
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Theorem 3.1.4 (Properties of conditional expectation). For given random 
variables X, Xn, Y € L, the following properties hold: 

(1) Linearity: The map X + E[X|B] is linear. 

(2) Positivity: X > 0 => E[X|B] > 0. 

(3) Tower property: C C BC A => E[E[X|B]|C] = E[X|C]. 

(4) Conditional Fatou: |X,| <  X, Ellim infpo Xn|B]  < 
lim infpoo E[X,|8). 

(5) Conditional dominated convergence: |X,| < X,X, — X ae. 
=> E[X,|B] — E[X|B} ae. 

(6) Conditional Jensen: if h is convex, then E[h(X)!B] > h(E[X|B]). 
Especially ||E[X|B}||p < ||X||p. 

(7) Extracting knowledge: For Z € £°(B), one has E[ZX|B] = ZE[X|B]. 
(8) Independence: if X is independent of C, then E[X|C] = E[X]. 


Proof. (1) For positivity, note that if Y = ELX|B] would be negative on a 
set of positive measure, then A = Y~1({—1/n,0]) € B would have positive 
probability for some n. This would lead to the contradiction 0 < E[14X] = 
E{l4Y] < —n7!m(A) <0. 


(2) Use that P” < P’ < P implies P” = Y'P’ = Y'Y P and P” < P gives 
P"” = ZP so that Z = Y'Y almost everywhere. 


(3) This is especially useful when applied to the algebra Cy = {0,Y,Y°, Q}. 
Because X < Y almost everywhere if and only if E[X|Cy] < E[Y|Cy] for 
all Y € B. 

(4)-(5) The conditional versions of the Fatou lemma or the dominated 
convergence theorem are true, if they are true conditioned with Cy for 
each Y € B. The tower property reduces these statements to versions with 
B= Cy which are then on each of the sets Y, Y° the usual theorems. 


(6) Chose a sequence (an, bn) € R* such that A(x) = SUP,, @nZ + by for all 
x € R. We get from h(X) > anX + bp that almost surely E[h(X)|G] > 
anE[X |G] + b,. These inequalities hold therefore simultaneously for all n 
and we obtain almost surely 


E[A(X)|G] = sup(anE[X|G] + bn) = h(E[X|G]) . 
The corollary is obtained with h(x) = |z|?. 


(7) It is enough to condition it to each algebra Cy for Y € B. The tower 
property reduces these statements to linearity. 


(8) By linearity, we can assume X > 0. For B € B and C €C, the random 
variables X1g and lc are independent so that 


E[X1 gnc] = E[X1pic] = E[X1,]P[C] . 
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The random variable Y = E[X|B] is B measurable and because Y1g is 
independent of C we get 


E({(Y1g)1c] = E[Y1]P[C] 
so that E[lgncX] = E[lgacY]. The measures on o(B,C) 
pw: Ar Bl,X],v: Ab E[LaY] 


agree therefore on the 7-system of the form BNC with B€ Band CEC 
and consequently everywhere on o(B,C). Oo 


Remark. From the conditional Jensen property in theorem (3.1.4), it fol- 
lows that the operation of conditional expectation is a positive and contin- 
uous operation on £L? for any p> 1. 


Remark. The properties of Conditional Fatou, Lebesgue and Jensen are 
statements about functions in £'(B) and not about numbers as the usual 
theorems of Fatou, Lebesgue or Jensen. 


Remark. Is there for almost all w € 2 a probability measure P,, such that 
E[X|B|(w) = iy X dP,? 
9) 


If such a map from 2 to M,(Q) exists and if it is B-measurable, it is called 
a regular conditional probability given B. In general such a map w + P,, 
does not exist. However, it is known that for a probability space (Q, A, P) 
for which 2 is a complete separable metric space with Borel c-algebra A, 
there exists a regular probability space for any sub o-algebra B of A. 


Exercice. This exercise deals with conditional expectation. 

a) What is E[Y|Y]? 

b) Show that if ELX|A] = 0 and E[X|B] = 0, then E[X|o(A, B)] = 0. 

c) Given X,Y € CL’ satisfying E[X|Y] = Y and E[Y|X] = X. Verify that 
X =Y almost everywhere. 


We add a notation which is commonly used. 


Definition. The conditional probability space (Q,.A, P[-|B]) is defined by 
P[B | B] = E[1p|B] . 


For X € L?, one has the conditional moment E[X?|B] = E[X?|B] if B be a 
o-subalgebra of A. They are B-measurable random variables and generalize 
the usual moments. Of special interest is the conditional variance: 


Definition. For X € £’, the conditional variance Var[X |B] is the random 
variable E[X?|B] — E[X|B]?. Especially, if B is generated by a random vari- 
able Y, one writes Var[X |Y] = E[X?|Y] — E[X|Y]?. 
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Remark. Because conditional expectation is a projection, all properties 
known for the usual variance hold the more general notion of conditional 
variance. For example, if X,Z are independent random variables in Vig 
then Var[X + Z|Y] = Var[X|Y] + Var[Z|Y]. One also has the identity 
Var[X|Y] = E{(X — E[X|Y])?|Y]. 


Lemma 3.1.5. (Law of total variance) For X € £? and an arbitrary random 
variable Y, one has 


Var[X] = E[Var[X|Y]] + Var[E[X|Y]] . 


Proof. By the definition of the conditional variance as well as the properties 
of conditional expectation: 


li 


Var[X] E[X?] — E[X?? 
= E(E[X?|Y]] - BE[X|Y]) 
E[Var[X|Y]] + E[E[X|Y]?] — B[ELX|Y]]? 


= E[Var[X|Y]] + Var[E[X|Y]] . 


O 


Here is an application which illustrates how one can use of the conditional 
variance in applications: the Cantor distribution is the singular continuous 
distribution with the law y has its support on the standard Cantor set. 


Corollary 3.1.6. (Variance of the Cantor distribution) The standard Cantor 
distribution for the Cantor set on [0,1] has the expectation 1/2 and the 
variance 1/8. 


Proof. Let X be a random variable with the Cantor distribution. By sym- 
metry, E[X] = i x du(x) = 1/2. Define the o-algebra 


{0, [0, 1/3), [1/3, 1], [0, 1] } 


on 2 = [0,1]. It is generated by the random variable Y = 1)9,1/3). Define 
Z = E[X|Y]. It is a random variable which is constant 1/6 on [0,1/3) 
and equal to 5/6 on [1/3, 1]. It has the expectation E[Z] = (1/6)P[Y = 
1] + (5/6)P[Y = 0] = 1/12 + 5/12 = 1/2 and the variance 


: ee a OB te = 
Var[Z] = E[Z?] — E[Z]? = Brigg =1)+ crated =0]-1/4=1/9. 
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Define the random variable W = Var[X|Y] = E[X?|Y] — E[X|Y|? = 
E[X?|Y] — Z. It is equal to Ree — 1/6)? dz on [0,1/3] and equal to 
La (x — 5/6)? dr on (2/3, 3/3]. By the self-similarity of the Cantor set, we 
see that W = Var[X|Y] is actually constant and equal to Var[X]/9. The 
identity E[Var[X|Y]] = Var[X]/9 implies 

Var[X] = E[Var[X|Y]] + Var[E[X|Y]] = E[W] + Var[Z] = Vari] 4 ; 
Solving for Var[X] gives Var[X] = 1/8. Oo 


Exercice. Given a probability space (Q,A,P) and a o-algebra B C A. 

a) Show that the map P : X € £1 ++ E[X|B] is a Markov operator from 
L'(A,P) to £'(B,Q), where Q is the conditional probability measure on 
(Q, B) defined by Q[A] = P[A] for A € B. 

b) The map T can also be viewed as a map on the new probability space 
(Q, B,Q), where Q is the conditional probability. Denote this new map by 
S. Show that S is again measure preserving and invertible. 


Exercice. a) Given a measure preserving invertible map T : 2 — Q we call 
(Q,T,.A,P) a dynamical system. A complex number 4 is called an eigen- 
value of T, if there exists X € £? such that X(T) = AX. The map T is said 
to have pure point spectrum, if there exists a countable set of eigenvalues 
,; such that their eigenfuctions X; span £?. Show that if T has pure point 
spectrum, then also S has pure point spectrum. 

b) A measure preserving dynamical system (A, S, B, v) is called a factor of a 
measure preserving dynamical system (22, T,.A, 4) if there exists a measure 
preserving map U : 2 — A such that SoU(z) = UoT(z) for all x € 2. Ex- 
amples of factors are the system itself or the trivial system (Q, S(z) = a, ). 
If S is a factor of T and T is a factor of S, then the two systems are called 
isomorphic. Verify that every factor of a dynamical system (2, T, A, 4) can 
be realized as (0, T, B, uz) where B is a o-subalgebra of A. 

c) It is known that if a measure preserving transformation T on a proba- 
bility space has pure point spectrum, then the system is isomorphic to a 
translation on the compact Abelian group G which is the dual group of the 
discrete group G formed by the spectrum o(T) c T. Describe the possible 
factors of T and their spectra. 


Exercice. Let 2 = T! be the one-dimensional circle. Let A be the Borel o- 
algebra on T! = R/(27Z) and P = dx the Lebesgue measure. Given k € N, 
denote by B, the o-algebra consisting of all A € A such that A+ nen = 
A (mod 27) for all 1 < n < k. What is the conditional expectation E[X|B,] 
for a random variable X € £1? 
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3.2 Martingales 


It is typical in probability theory is that one considers several o-algebras on 
a probability space (2,.A,P). These algebras are often defined by a set of 
random variables, especially in the case of stochastic processes. Martingales 
are discrete stochastic processes which generalize the process of summing 
up IID random variables. It is a powerful tool with many applications. 


Definition. A sequence {A,}nen of sub o-algebras of A is called a fil- 
tration, if Ag C Ai C --: C A. Given a filtration {An}nen, one calls 
(Q,A,{An}nen, P) a filtered space. 


Example. If 2 = {0,1}" is the space of all 0 — 1 sequences with the Borel 
o-algebra generated by the product topology and A,, is the finite set of 
cylinder sets A = {x, = @),...,2, =a,» } with a; € {0,1}, which contains 
2” elements, then {A,}nen is a filtered space. 


Definition. A sequence X = {X,,}nen of random variables is called a dis- 
crete stochastic process or simply process. It is a L?-process, if each X,, 
is in £?. A process is called adapted to the filtration {A,,} if X, is An- 
measurable for all n € N. 


Example. For 2 = {0,1}" as above, the process X,(r) = []/_, 2% is 
a stochastic process adapted to the filtration. Also S,,() oe x; is 
adapted to the filtration. 


Definition. A £'-process which is adapted to a filtration {A,} is called a 
martingale if 


E[Xn|An-1] = Xn-1 


for all n > 1. It is called a supermartingale if E|X,,|An—1] < X,-1 and a 
submartingale if E[X,,|An—i| > Xn—1. If we mean either submartingale or 
supermartingale (or martingale) we speak of a semimartingale. 


Remark. It immediately follows that for a martingale 

E[Xn|Am] = Xm 
if m <n and that E[X,,| is constant. Allan Gut mentions in [34] that a 
martingale is an allegory for “life” itself: the expected state of the future 


given the past history is equal the present state and on average, nothing 
happens. 
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Figure. A random variable X on the unit square defines a gray scale picture 
if we interpret X (x,y) is the gray value at the point (x,y). It shows Joseph 
Leo Doob (1910-2004), who developed basic martingale theory and many 
applications. The partitions A, = {|k/2"(k + 1)/2") x [j/2"(j + 1)/2”)} 
define a filtration of 2 = [0,1] x [0,1]. The sequence of pictures shows the 
conditional expectations E[X,,|An]. It is a martingale. 


Exercice. Determine from the following sequence of pictures, whether it is 
a supermartingale or a submartingale. The images get brighter and brighter 
in average as the resolution becomes better. 


, ; ara] 
r h 
wily 
if : 


Definition. If a martingale X,, is given with respect to a filtered space 
An = 0(Yo,..-,¥n), where Y,, is a given process, X is is called a martingale 
with respect Y. 


Remark. The word “martingale” means a gambling system in which losing 
bets are doubled. It is also the name of a part of a horse’s harness or a belt 
on the back of a man’s coat. 


Remark. If X is a supermartingale, then —X is a submartingale and vice 
versa. A supermartingale, which is also a submartingale is a martingale. 
Since we can change X to X — Xo without destroying any of the martingale 
properties, we could assume the process is null at 0 which means Xo = 0. 


Exercice. a) Verify that if X,,¥, are two submartingales, then sup(X, Y) 
is a submartingale. 

b) If X, is a submartingale, then E[X,,] < E[X,-1]. 

c) If X, is a martingale, then E[X,| = E[X,-1]. 


Remark. Given a martingale. From the tower property of conditional ex- 
pectation follows that for m <n 


E[Xn|Am] = E[E[Xn|An—1]|Am] = E[Xn—1|Am] = +++ = E[Xrn] - 
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Example. Sum of independent random variables 

Let X; € £' be a sequence of independent random variables with mean 
E[X;] = 0. Define Sp = 0, Sn = ype Xe and Ay = o(Xy,...,Xn) with 
Ao = {0,2}. Then S, is a martingale since S, is an {An }-adapted L’- 


process and 
E[Sn|An-1] rx E[Sn—1|An-1] + E[Xn|An-1] = Snot + E[X,] = Sn-1 : 


We have used linearity, the independence property of the conditional ex- 
pectation. 


Example. Conditional expectation 

Given a random variable X € L' on a filtered space (, A, {An}nen, P). 
Then X,, = E[X|A,] is a martingale. 

Especially: given a sequence Y,, of random variables. Then A, = (Yo,---, Yn) 
is a filtered space and X,, = E[X|Yo,..., Yn] is a martingale. Proof: by the 
tower property 


Bow veal = Bali YaloeeYacl 
E[XnlYo, sine Ya-1| = Xn-1 


We say X is a martingale with respect to Y. Note that because X,, is by 
definition o(Yo,..., Yn)-measurable, there exist Borel measurable functions 
hy : R™+1 | R such that Xn = An(Yo,---, Yn-1)- 


Example. Product of positive variables 

Given a sequence Y;, of independent random variables Y,, > 0 satisfying 
with E[Y,] = 1. Define Xo = 1 and X, = [Jj_) %i and An =9(N.,..., Yn). 
Then X, is a martingale. This is an exercise. Note that the martingale 
property does not follow directly by taking logarithms. 


Example. Product of matrix-valued random variables 

Given a sequence of independent random variables Z, with values in the 
group GL(N,R) of invertible N x N matrices and let An = 0(Z1,...,Zn)- 
Assume Eflog ||Zn||] < 0, if ||Zn|| denotes the norm of the matrix (the 
square root of the maximal eigenvalue of Z,,-Z*, where Z* is the adjoint). 
Define the real-valued random variables X, = log ||Z1 - Z2--- Zn||, where - 
denotes matrix multiplication. Because X, < log ||Zn|| + Xn-1, we get 


E[Xn|An-1] 


A 


Eflog ||Zn|| | An—1] + E[Xn—1|An—1] 
Eflog ||Zn||] + Xn-1 < Xn-1 


i 


so that X, is a supermartingale. In ergodic theory, such a matrix-valued 
process X,, is called sub-additive. 


Example. If Z, is a sequence of matrix valued random variables, we can 
also look at the sequence of random variables Y, = ||Z1 - Z2---Zn||. If 
E{||Z,||] = 1, then Y, is a supermartingale. 
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Example. Polya’s urn scheme 

An urn contains initially a red and a black ball. At each time n > 1, a 
ball is taken randomly, its color noted, and both this ball and another 
ball of the same color are placed back into the urn. Like this, after n 
draws, the urn contains n + 2 balls. Define Y;, as the number of black balls 
after n moves and X,, = Y,/(n + 2), the fraction of black balls. We claim 
that X is a martingale with respect to Y: the random variables Y,, take 
values in {1,...,2+ 1}. Clearly P[¥,41 =k+1]Y, =k] =k/(n +2) and 
P[¥n41 = klY, = k] =1—k/(n +2). Therefore 


ELXoail¥iysces¥al agent las. Ya 


= PIY, k+1|Y, =k]-PlY, 
—P| y=k+1| }- P[¥n41] 
+P[Ynur =k | Ye =A} P[¥a] 

Apes 


1 am 
= ——|(Y, +1) aod 


n+3 
Yn 
n+2 


)] 


Note that X,, is not independent of X,_,. The process ” learns” in the sense 
that if there are more black balls, then the winning chances are better. 


fekere) 
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Figure. A typical run of 30 999999 
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experiments with Polya’s urn 999900009 
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Example. Branching processes 
Let Z,; be IID, integer-valued random variables with positive finite mean 
m. Define Yo = 1 and 


with the convention that for Y, = 0, the sum is zero. We claim that X, = 
Y,/m” is a martingale with respect to Y. By the independence of Y,, and 
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Zni,i > 1, we have for every n 
Yn Yn 
E[¥ntilYo.--+)¥n] = E[D> Znl¥o,--- Yad = E(>> Znk] = Yn 
k=1 k=1 


so that 
E[Xn41l¥o.---+¥al = E[Ynsil¥o,---Yn]/m™*! = mY, /m"*? = Xn 


The branching process can be used to model population growth, disease 
epidemic or nuclear reactions. In the first case, think of Y, as the size of a 
population at time n and with Z,; the number of progenies of the 7 — th 
member of the population, in the n’th generation. 


Figure. A typical growth of Yn 
of a branch process. In this ex- 
ample, the random variables Zyj 
had a Poisson distribution with 
mean m = 1.1. It is possible that 
the process dies out, but often, it 
grows exponentially. 


Proposition 3.2.1. Let A, be a fixed filtered sequence of o-algebras. Lin- 
ear combinations of martingales over A, are again martingales over An. 
Submartingales and supermartingales form cones: if for example X,Y are 
submartingales and a,b > 0, then aX + bY is a submartingale. 


Proof. Use the linearity and positivity of the conditional expectation. O 


Proposition 3.2.2. a) If X is a martingale and u is convex such that u(Xn) € 
L', then Y = u(X) is a submartingale. Especially, if X is a martingale, 
then |X| is a submartingale. 

b) If u is monotone and convex and X is a submartingale such that u(Xn) € 
L}, then u(X) is a submartingale. 
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Proof. a) We have by the conditional Jensen property (3.1.4) 
Yn = u(Xn) = u(E[Xn4i|An]) < E[u(Xn41)|An] = E[¥n4al |An] - 


b) Use the conditional Jensen property again and the monotonicity of u to 
get 


Yn = u(Xn) S u(E[Xn41lAn]) $ Elu(Xn42)}Anl = [Maer An] - 
0 


Definition. A stochastic process C = {Cn}n>1 is called previsible if C,, is 
An—1-measurable. A process X is called bounded, if X,, € £° and if there 
exists K € R such that ||Xn|loo < K for alln EN. 


Definition. Given a semimartingale X and a previsible process C’, the pro- 
cess 


(fC aX)n = Oe(Xe- Xi-1). 
k=1 


It is called a discrete stochastic integral or a martingale transform. . 


Theorem 3.2.3 (The system can’t be beaten). If C is a bounded nonnega- 
tive previsible process and X is a supermartingale then f{ C dX is a super- 
martingale. The same statement is true for submartingales and martingales. 


_—ue—q—woeKaeqK 


Proof. Let Y = [ C dX. From the property of extracting knowledge” in 
theorem (3.1.4), we get 


E[¥n—Yn-1/An-1] = E[Cn(Xn—-Xn-1)|An-1] = Cn-E[Xn—Xn—1|An-1] < 0 
because C;, is nonnegative and X,, is a supermartingale. Oo 


Remark. If one wants to relax the boundedness of C, then one has to 
strengthen the condition for X. The proposition stays true, if both C and 
X are L?-processes. 


Remark. Here is an interpretation: if X,, represents your capital in a game, 
then X,, — Xn_1 are the net winnings per unit stake. If Cy, is the stake on 
game n, then 


fe dX = S>Cu(X — Xe-1) 
k=1 


are the total winnings up to time n. A martingale represents a fair game 
since E[X;, ~ Xn—1|An—1] = 0, whereas a supermartingale is a game which 
is unfavorable to you. The above proposition tells that you can not find a 
strategy for putting your stake to make the game fair. 
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Figure. In this erample, X, = 
+1 with probability 1/2 and it 
Cy = 1 if Xn-1 is even and a : 
Cr = 0 if Xn_ is odd. The orig- Ta hehe 

inal process Xp, is a symmetric FARO ants uy 
random walk and so a martin- bead 
gale. The new process f C dX is ; 
again a martingale. ea 


Exercice. a) Let Yi, Yo,... be a sequence of independent non-negative ran- 
dom variables satisfying E[Y,] = 1 for all k € N. Define Xp = 1, Xn = 
Y,---Y, and A, = 0(Y¥1, Y2,..., Yn). Show that X,, is a martingale. 

b) Let Z, be a sequence of independent random variables taking values in 
the set of n x n matrices satisfying E|||Z,||] = 1. Define Xo = 1,Xn = 
\|Z1--+Z,||. Show that X,, is a supermartingale. 


Definition. A random variable T with values in N = N U {oo} is called 
a random time. Define Ax = o(U,59An). A random time T is called a 
stopping time with respect to a filtration An, if {T < n} € Ap for all 
neN. 


Remark. A random time T is a stopping time if and only if {T =n } € An 
for all n € N since {T < n} = Upcnen{T = k} € An. 


Remark. Here is an interpretation: stopping times are random times, whose 
occurrence can be determined without pre-knowledge of the future. The 
term comes from gambling. A gambler is forced to stop to play if his capital 
is zero. Whether or not you stop after the n—th game depends only on the 
history up to and including the time n. 


Example. First entry time. 
Let X,, be a An-adapted process and given a Borel set B € B in R?. Define 


T(w) =inf{n > 0| Xn(w) € B} 
which is the time of first entry of X, into B. The set {T = oo} is the set 
which never enters into B. Obviously 
n 
{T <n} = {Xi € B}E An 
k=0 


so that Tis a stopping time. 
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Example. ” Continuous Black-Jack”: let X; be IID random variables with 
uniform distribution in [0,1]. Define S$, = (y_, Xi and let T(w) be the 
smallest integer so that S,(w) > 1. This is a stopping time. A popular 
problem asks for the expectation of this random variable T: How many 
*cards” X; do we have to draw until we get busted and the sum is larger 
than 1? We obviously have P[T = 1] = 0. Now, P[T = 2] = P[X2 > 1-Xj] 
is the area of region {(x,y) € [0,1] x [0,1] | y > 1—2 } which is 1/2. 
Similarly P[T = 3} = P[X3 > 1 — X, — X9] is the volume of the solid 
{(z,y,z) € [0,1]? | z > 1-2 —y } which is 1/6 = 1/3!. Inductively we 
see P[T = k] = 1/k! and the expectation of T is E[T] = (72, k/k! = 
ico 1/k! = e. This means that if we play Black-Jack with uniformly 
distributed random variables and threshold 1, we expect to get busted in 
more than 2, but less than 3 ”cards”. 


Example. Last exit time. 
Assume the same setup as in 1). But this time 


T(w) = sup{n > 0| Xn(w) € B} 


is not a stopping time since it is impossible to know that X will return to 
B after some time k without knowing the whole future. 


Proposition 3.2.4. Let T,, 72 be two stopping times. The infimum 7, A To, 
the maximum 7) V T> as well as the sum T; + To are stopping times. 


Proof. This is obvious from the definition because A,-measurable functions 
are closed by taking minima, maxima and sums. O 


Definition. Given a stochastic process X,, which is adapted to a filtration 
An and let T be a stopping time with respect to A,, define the random 
variable 


Kalo) { Xrw(e) Te) <0 


or equivalently X; = paar Xnltran}. The process xP = XT~pn is called 
the stopped process. It is equal to Xr for times T < n and equal to X,, if 
T>n. 


Proposition 3.2.5. If X is a supermartingale and T is a stopping time, then 
the stopped process X7 is a supermartingale. In particular E[X7] < E[Xo]. 
The same statement is true if supermartingale is replaced by martingale in 
which case E[X7] = E[Xo]. 
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Proof. Define the ”stake process” C‘T) by oe = ln<r. You can think of 
it as betting 1 unit and quit playing immediately after time T. Define then 
the ” winning process” 


(fom ax) dX)n eer (Xp — Xp-1) = Xtan— Xo. 
k=1 
or shortly f C(T) dX = X7 — Xo. The process C is previsible, since it can 


only take values 0 and 1 and {co = 0} = {T <n-1} € A,-1. The claim 
follows from the ”system can’t be beaten” theorem. O 


Remark. It is important that we take the stopped process X7 and not the 
random variable Xr: 

for the random walk X on Z starting at 0, let T be the stopping time 
T = inf{n | X, =1}. This is the martingale strategy in casino which gave 
the name of these processes. As we will see later on, the random walk is 
recurrent P[T < oo] = 1 in one dimensions. However 


1=E[X7] # E[X] =0. 
The above theorem gives E[X7] = E[Xo]. 


When can we say E[X7] = E[Xo]? The answer gives Doob’s optimal stop- 
ping time theorem: 


Theorem 3.2.6 (Doob’s optimal stopping time theorem). Let X be a 
supermartingale and T be a stopping time. If one of the five following 
conditions are true 


(i) T is bounded. 

(ii) X is bounded and T is almost everywhere finite. 

(iii) T € L’ and |X, — Xn—1| is bounded. 

(i v) XTE L) and limz—oo E[Xx; {T > k}] = 

(v) X is uniformly integrable and T is tes aren where finite. 
then E[Xr] < E[Xo]. 
If X is a martingale and any of the five conditions is true, then E[X7] = 
E[Xo]. 


Proof. We know that E[Xr,~n — Xo] < 0 because X is a supermartingale. 
(i) Because T is bounded, we can take n = sup T(w) < oo and get 


E[Xr — Xo] = E[Xran — Xo] <0. 
(ii) Use the dominated convergence theorem to get 


lim E[Xran _ Xo] < 0. 
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(iii) We have a bound |X, — Xn-1| < K and so 


TAn 
[Xrnn — Xol =| > Xn — Xe- je SKT 
k=1 


Because KT € £', the result follows from the dominated convergence the- 
orem. 

(iv) By (i), we get E[Xo] = E[Xrpx] = E[X7;{T < k}] + E[X,;{T > k}] 
and taking the limit gives E[Xo] = limp... E[Xk;{T < k}] ~ E[X7] by 
the dominated convergence theorem and the assumption. 

(v) The uniformly integrability E[|Xn|;|X,| > R] — 0 for R — oo assures 
that Xr € L’ since E[|Xr|] < k- maxicicn El] X¢|] + sup, E[|Xnl;{7 > 
k}] < oo. Since |E[X,;{T > k}]| < sup, E[|Xn|;{T > k}] — 0, we can 
apply (iv). 


If X is a martingale, we use the supermartingale case for both X and 
eX; O 


Remark. The interpretation of this result is that a fair game cannot be 
made unfair by sampling it with bounded stopping times. 


Theorem 3.2.7 (No winning strategy). Assume X is a martingale and sup- 
pose |X, —Xpn_1| is bounded. Given a previsible process C' which is bounded 
and let T € L be a stopping time, then E[({ CdX)r] = 0. 


Proof. We know that f C dX is a martingale and since (f C dX) = 0, the 
claim follows from the optimal stopping time theorem part (iii). Oo 


Remark. The martingale strategy mentioned in the introduction shows 
that for unbounded stopping times, there is a winning strategy. With the 
martingale strategy one has T = n with probability 1/2”. The player always 
wins, she just has to double the bet until the coin changes sign. But it 
assumes an "infinitely thick wallet”. With a finite but large initial capital, 
there is a very small risk to lose, but then the loss is large. You see that in 
the real world: players with large capital in the stock market mostly win, 
but if they lose, their loss can be huge. 


Martingales can be characterized involving stopping times: 
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Theorem 3.2.8 (Komatsu’s lemma). Let X be an A,-adapted sequence of 
random variables in £' such that for every bounded stopping time T 


E[X7] aa E[Xo] ’ 


then X is a martingale with respect to An. 


Proof. Fix n € N and A € A,. The map 


n weEA 
T=n+1-1a={ n+1 w€A 
is a stopping time because o(T) = {, A, A°,Q } C An. Apply E[X7] = 
E[Xo] and E[X7’] = E[Xo] for the bounded constant stopping time T’ = 
n+ 1 to get 
E[Xn; A] + E[Xn41; A‘) E[X7] = E[Xo] = E[X7] = E[Xn41] 
E[Xn41; A] + E[Xn41; A‘ 


so that E[.X,41; A] = E[Xn; A]. Since this is true, for any A € An, we know 
that E[Xn4i1|An] = E[Xn|An] = Xn and X is a martingale. Oo 


Example. The gambler’s ruin problem is the following question: Let Y; be 
IID with P[Y¥; = +1] = 1/2 and let X, = 0;_, Yi be the random walk 
with Xo = 0. We know that X is a martingale with respect to Y. Given 
a,b > 0, we define the stopping time 


T = min{n >0|X, =, or X, =—a}. 


We want to compute P[X7 = —a] and P[Xr = b] in dependence of a, b. 


Figure. Three samples of a pro- 
cess X, starting at Xo = 0. 
The process is stopped with the 
stopping time T, when X,, hits 
the lower bound ~—a or the upper 
bound b. If Xp is the winning of a 
first gambler, which is the loss of 
a second gambler, then T is the 
time, for which one of the gam- 
blers is broke. The initial capital 
of the first gambler is a, the ini- 
tial capital of the second gambler 
is b. 
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Remark. If Y; are the outcomes of a series of fair gambles between two 
players A and B and the random variables X,, are the net change in the 
fortune of the gamblers after n independent games. If at the beginning, A 
has fortune a and B has fortune 6, then P[X7p = —a] is the ruin probability 
of A and P[Xr = }] is the ruin probability of B. 


Proposition 3.2.9. 


P[Xr = -a] =1-P[X7p =)] = 


(a+b) ° 


Proof. T is finite almost everywhere. One can see this by the law of the 
iterated logarithm, 


lim sup + =1, Kim inf =-l. 
(We will give later a direct proof the finiteness of T, when we treat the 
random walk in more detail.) It follows that PIX; = —a] = 1—P[Xr = }}. 
We check that X;, satisfies condition (iv) in Doob’s stopping time theorem: 
since Xr takes values in {a,b }, it is in £’ and because on the set {T > k }, 
the value of X;, is in (—a,b), we have |E[X,;{T > k }]| < max{a, b}P[T > 
k] — 0. O 


Remark. The boundedness of T is necessary in Doob’s stopping time the- 
orem. Let T = inf{n | X, = 1 }. Then E[X7] = 1 but E[Xo] = 0] which 
shows that some condition on T or X has to be imposed. This fact leads 
to the ”martingale” gambling strategy defined by doubling the bet when 
loosing. If the casinos would not impose a bound on the possible inputs, 
this gambling strategy would lead to wins. But you have to go there with 
enough money. One can see it also like this, If you are A and the casino is 
B and b= 1, a = o then P[Xr = 6] = 1, which means that the casino is 
ruined with probability 1. 


Theorem 3.2.10 (Wald’s identity). Assume T is a stopping time of a L'- 
process Y for which Y; are ITD random variables with expectation E[Y;] = m 
and T € L’. The process S, = hai Ye Satisfies 


E[Sr] = mE(7]. 
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Proof. The process Xp, = S, — nE[Y}] is a martingale satisfying condition 
(iii) in Doob’s stopping time theorem. Therefore 


0 = E[Xo] = ElXz] = E[Sr - TEM] . 
Now solve for E[S7]. Oo 


3.3 Doob’s convergence theorem 


Definition. Given a stochastic process X and two real numbers a < b, we 
define the random variable 


Un{a,b]w) = max{keN|3 
O0< sy <ty<-:- <3 <th <n, 
X5,(w) <a, X1,(w) >b,1<i< k} 


called the number of up-crossings of [a,b]. Denote with U..[a, b] the limit 
U pola; b| = lim-U,,[a,.b) . 
nm— OO 


Because n +> U,,[a, b| is monotone, this limit exists in N U {oo}. 


Figure. A random walk crossing 
two values a < b. An up-crossing 
is a time s, where X, < a un- 
til the time, when the first time 
Xt > 0b. The random variable 
U,, [a,b] with values in N mea- 
sures the number of up-crossings 
in the time interval (0, n]. 


Theorem 3.3.1 (Doob’s up-crossing inequality). If X is a supermartingale. 
Then 


(b 7 a)E[U, [a, b]) < E[(Xn — a) | y 


Proof. Define C) = 1,x,<a} and inductively for n > 2 the process 


Ch = Lic, -1=1}1{Xn-1 <b} + Lyo,_1=0}1{x,-1<0 } 
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It is previsible. Define the winning process Y = fC dX which satisfies by 
definition Yo = 0. We have the winning inequality 


¥n(w) > (b— a)Un|[a, b](w) — (Xn(w) — a)” 


Every up-crossing of [a, b] increases the Y-value (the winning) by at least 

(b — a), while (X,, — a)~ is essentially the loss during the last interval of 

play. 

Since C is previsible, bounded and nonnegative, we know that Y;, is also a 

supermartingale (see ”the system can’t be beaten”) and we have therefore 

E[Yn] < 0. Taking expectation of the winning inequality, gives the claim. 
O 


Remark. The proof uses the following strategy for putting your stakes C: 
wait until X gets below a. Play then unit stakes until X gets above b and 
stop playing. Wait again until X gets below a, etc. 


Definition. We say, a stochastic process X, is bounded in L?, if there exists 
M € R such that ||Xn||p < M for alln EN. 


I 


Corollary 3.3.2. If X is a supermartingale which is bounded in £}. Then 


P[U.o[a, b] = oo] = 0. 


NN 


Proof. By the up-crossing lemma, we have for each n € N 


(b — a)E(U,,[a, 6] < |a| + Ell Xall < lal + sup ||Xnlh < ce. 


By the dominated convergence theorem 
(b — a)E[U..[a, 5]] < 00 , 
which gives the claim. O 


Remark. If S, = 4 X; is the one dimensional random walk, then it is 
a martingale which is unbounded in L'. In this case, E[U.[a, b]] = 00. 


NN 


Theorem 3.3.3 (Doob’s convergence theorem). Let Xn be a supermartingale 
which is bounded in £’. Then 


Xoo = lim Xn 
NAO 
exists almost everywhere. 


NN $$$ 
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Proof. 


A = {wE€Q2|X, has no limit in [—00, oo] } 
= {wE2| liminf X, < limsup X, } 
= U {w €Q| liminf X, <a <b < limsup X, } 
a<b,a,beQ 


= th Ae 


a<b,a,bEQ 


Since Ag» C {U|[a,b] = co } we have P[A,,] = 0 and therefore also 
P[A] = 0. Therefore X. = limn—o Xn exists almost surely. By Fatou’s 
lemma 


E||Xoo|] = Eflim inf |X,,|) < lim inf E[|X,|] < sup E[|Xn|] < 00 


so that P[Xq < oo] = 1. O 


Example. Let X be a random variable on ((0,1),A,P), where P is the 
Lebesgue measure. The finite c-algebra A, generated by the intervals 


defines a filtration and X,, = E[X|A,] is a martingale which converges. We 
will see below with Lévys upward theorem (3.4.2 that the limit actually is 
the random variable X. 


Example. Let X; be IID random variables in Co Por 0). << 1, the 
branching random walk S, = 7; A* X;, is a martingale which is bounded 
in £! because 


1 
Soll: S$ —lIXoll:- 


The martingale converges by Doob’s convergence theorem almost surely. 
One can also deduce this from Kolmogorov’s theorem (2.11.3) if X; € L’. 
Doobs convergence theorem (3.3.3) assures convergence for X, € L'. 


Remark. Of course, we can replace supermartingale by submartingale or 
martingale in the theorem. 


Example. We look again at Polya’s urn scheme, which was defined earlier. 
Since the process Y giving the fraction of black balls is a martingale and 
bounded 0 < Y < 1, we can apply the convergence theorem: there exists 
Yoo with Y, > Yoo. 


Corollary 3.3.4. If X is a non-negative supermartingale, then X. = 
limpoo Xn exists almost everywhere and is finite. 
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Proof. Since the supermartingale property gives E[|Xn|] = E[X,] < ELXo], 
the process X;, is bounded in £'. Apply Doob’s convergence theorem. 0 


Remark. This corollary is also true for non-positive submartingales or mar- 
tingales, which are either nonnegative or non-positive. 


Example. For the Branching process, we had IID random variables Z,,; 
with positive finite mean m and defined Yo = 0, Yn41 = es, Znk. We 
saw that the process X, = Y,/m"” is non-negative and a martingale. Ac- 
cording to the above corollary, the limit X.. exists almost everywhere. It 
is an interesting problem to find the distribution of X,,: Assume Z,,; have 
the generating function f (0) = E[62”*]. 


(i) Y, has the generating function f"(0) = f(f"—1)(0). 
We prove this by induction. For n = 1 this is trivial. Using the independence 
of Zn we have 

E[o""+|¥, = k] = f(0)* 


and so 
Ele" [Yn] =F)" 


By the tower property, this leads to 
E[o""+1] = E[f(6)7"] . 
Write a = f(@) and use induction to simplify the right hand side to 
E[f(0)""] = Ela] = f(a) = f"(f(0)) = f7*(0) . 


(ii) In order to find the distribution of X,. we calculate instead the char- 
acteristic function 


L(A) = L(Xo0)(A) = Elexp(iAXo0)] - 


Since X, — X. almost everywhere, we have L(X,)(A) > L(Xoo)(A). 
Since X, = Y,,/m” and E[6*"] = f"(0), we have 


L(Xn)(A) = fr(e"™") 
so that L satisfies the functional equation 
L(Am) = f(L(A)) - 


Theorem 3.3.5 (Limit distribution of the branching process). For the 
branching process defined by IID random variables Z,; having. the gen- 
erating function f, the Fourier transform L(A) = E[e*=] of the distribu- 
tion of the limit martingale X,. can be computed by solving the functional 
equation 


L(A-m) = f(L(A)) . 
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Remark. If f has no analytic extension to the complex plane, we have to 
replace the Fourier transform with the Laplace transform 


L(A) = Ele>**~] . 


Remark. Related to Doob’s convergence theorem for supermartingales is 
Kingman’s subadditive ergodic theorem, which generalizes Birkhoff’s er- 
godic theorem and which we state without proof. Neither of the two theo- 
rems are however corollaries of each other. 


Definition. A sequence of random variables X,, is called subadditive with 
respect to a measure preserving transformation T,, if Xen < Amt anl(Ll”) 
almost everywhere. 


Theorem 3.3.6 (The subadditive ergodic theorem of Kingmann). Given a 
sequence of random variables, which X, : X — RU {—oo} with oS 
max(0,X,) € L1(X) and which is subadditive with respect to a measure 
preserving transformation T. Then there exists a T-invariant integrable 
measurable function X : Q + RU {—co} such that +X,(2) > X(z) for 
almost all « € X. Furthermore +E[X,] — E[X]. 


If the condition of boundedness of the process in Doob’s convergence the- 
orem is strengthened a bit by assuming that X, is uniformly integrable, 
then one can reverse in some sense the convergence theorem: 


Theorem 3.3.7 (Doob’s convergence theorem for uniformly integrable su- 
permartingales). A supermartingale X,, is uniformly integrable if and only 
if there exists X such that X, — X in L’. 


Proof. If Xp is uniformly integrable, then X,, is bounded in L’ and Doob’s 
convergence theorem gives X, — X almost everywhere. But a uniformly 
integrable family X,, which converges almost everywhere COnvenEes in LI. 
On the other hand, a sequence X,, € CL! converging to X € £} is uniformly 
integrable. O 


Theorem 3.3.8 (Characterization of uniformly integrable martingales). An 
An-adapted process is an uniformly integrable martingale if and only if 
X, > X in Lt and X, = E[X|A,]. 
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Proof. By Doob’s convergence theorem for uniformly integrable supermartin- 
gales (3.3.7), we know the ”if’-part. We already know that X, = E[X|.A,] 
is a martingale. What we have to show is that it is uniformly integrable. 
Given € > 0. Choose 6 > 0 such that for all A € A, the condition P[A] < 6 
implies E[|X|;.A] < ¢. Choose further K € R such that K~!- E[|X|] < 6. 
By Jensen’s inequality 


[Xn] = |E[X|An]| < El|X||An] < El|X]] . 


Theretore 
K -P||X,| > K] < B[|Xnl] < E[[X]] <6-K 


so that P[|X,,| > K] < 6. By definition of conditional expectation , |Xn| < 
E[|X||An] and {|Xn| > K} € An 


E[|Xnls [Xn > K] < E[|X|;|Xnl > K] <e. 
O 


Remark. As a summary we can say that supermartingale X,, which is either 
bounded in £! or nonnegative or uniformly integrable converges almost 
everywhere. 


Exercice. Let S and T be stopping times satisfying S < T. 
a) Show that the process 


Cr(w) = 1p 5(u)<n<T(w)} 


is previsible. 
b) Show that for every supermartingale X and stopping times S < T the 
inequality 

E[X7] < E[X5] 


holds. 


Exercice. In Polya’s urn process, let Y,, be the number of black balls after 
n steps. Let Xn = Yn/(n + 2) be the fraction of black balls. We have seen 
that X is a martingale. 

a) Prove that P[Y, =k] =1/(n+ 1) for every 1<k<n+l1. 

b) Compute the distribution of the limit X.o. 


Exercice. a) Which polynomials f can you realize as generating functions 
of a probability distribution? Denote this class of polynomials with P. 

b) Design a martingale X,,, where the iteration of polynomials P € P plays 
a role. 
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c) Use one of the consequences of Doob’s convergence theorem to show 
that the dynamics of every polynomial P € P on the positive axis can be 
conjugated to a linear map T : z + mz: there exists a map L such that 


LoT(z) = Po L(z) 


for every z € Rt. 


Example. The branching process Yn41 = San Znk defined by random 
variables Z,,~ having generating function f and mean m defines a mar- 
tingale X, = Yn/m"”. We have seen that the Laplace transform L(A) = 
E[e~**~] of the limit X. satisfies the functional equation 


L(mA) = f(L()) . 


We assume that the IID random variables Z,,, have the geometric distribu- 
tion P[Z = k] = p(1—p)* = pq* with parameter 0 < p < 1. The probability 
generating function of this distribution is 


= z7_¥ kgk — __P 
§(0) = E07] = 3 oat = 


As we have seen in proposition (2.12.5), 
— q 
E[Z] = > pq*k = = . 
k=1 P 


The function f”(@) can be computed as 


p= ee 

This is because f is a Mobius transformation and iterating f corresponds 
to look at the power A” = = ; . This power can be computed by 
diagonalisating A: 


Re 7. SA AP ee eG q —-P 
We get therefore 


L(A) = Ble?*=] = tim Bfe~*¥/™"] = tim fa(e/m") = PATI? 
n—0o n—0o qa +q-p 
If m < 1, then the law of X., is a Dirac mass at 0. This means that the 
process dies out. We see that in this case directly that limn—oo fn(@) = 1. In 
the case m > 1, the law of X.. has a point mass at 0 of weight p/g = 1/m 
and an absolutely continuous part (1/m — 1)?e0/™-)* dz. This can be 
seen by performing a ”look up” in a table of Laplace transforms 


EX) = Be? + / (1 — p/q)2eP/9-Y* . e* der. 
0 
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Definition. Define p, = P[Y, = 0], the probability that the process dies 
out until time n. Since p, = f"(0) we have pnii1 = f(pn). If f(p) = p, p is 
called the extinction probability. 


Proposition 3.3.9. For a branching process with E[Z] > 1, the extinction 
probability is the unique solution of f(x) = z in (0,1). For E[Z] < 1, the 
extinction probability is 1. 


Proof. The generating function f(0) = E[67] = 0 P[Z = nja = 
>, PnO” is analytic in [0,1]. It is nondecreasing and satisfies f(1) = 1. 
If we assume that P[Z = 0] > 0, then f(0) > 0 and there exists a unique 
solution of f(z) = x satisfying f’(x) < 1. The orbit f”(u) converges to 
this fixed point for every u € (0,1) and this fixed point is the extinction 
probability of the process. The value of f’(0) = E[Z] decides whether there 
exists an attracting fixed point in the interval (0,1) or not. O 


3.4 Lévy’s upward and downward theorems 


Lemma 3.4.1. Given X € CL’. Then the class of random variables 
{Y =E[X|B] | BC A,Bis o — algebra } 


is uniformly integrable. 


Proof. Given € > 0. Choose § > 0 such that for all A € A, P[A] < 6 
implies E[|X|;.A] < €. Choose further K € R such that K~!- E[|X|] < 6. 
By Jensen’s inequality, Y = E[X|B] satisfies 


|Y| = LETX|B}| < Bf] X||8] < Bl|X]} . 


Therefore 
K -P(|X,| > K] < E{/Y|] < E[|X|] <6-Kk 


so that P[|Y| > K] < 6. Now, by definition of conditional expectation , 
[Y| < E[|X||8] and {|Y| > K}€B 


ElIXpl; Xgl > K] < EllX|;|Xgl > K] <e. 


Definition. Denote by A. the o-algebra generated by U,, An. 
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Theorem 3.4.2 (Lévy’s upward theorem). Given X € Li. Then Xy, 
E[X|An] is a uniformly integrable martingale and X, converges in te 
Xoo = E[X|Aoo]. 


Proof. The process X is a martingale. The sequence Xp is uniformly in- 
tegrable by the above lemma. Therefore X,. exists almost everywhere by 
Doob’s convergence theorem for uniformly integrable martingales, and since 
the family X, is uniformly integrable, the convergence is in L). We have 
to show that X. = Y := E[X|Aov0]. 

By proving the claim for the positive and negative part, we can assume 
that X > 0 (and so Y > 0). Consider the two measures 


Qi(A) = E[X; A], Q2(A) = E[Xo0; A] - 


Since E[X.o|An] = E[X|An], we know that Q: and Q2 agree on the 7- 
system U,, An. They agree therefore everywhere on A... Define the event 
A= {E[X|A] > Xoo } € Ago. Since Qi (A) a Q2(A) = E[E[X|Aoo] A 
Xoo}; A] = 0 we have E[X|A~] < Xoo almost everywhere. Similarly also 
Xoo < X|A.] almost everywhere. Oo 


As an application, we see a martingale proof of Kolmogorov’s 0 — 1 law: 


Corollary 3.4.3. For any sequence A, of independent o-algebras, the tail 
o-algebra T =(),, Bn with Bn = Umsn Am is trivial. 


Proof. Given A € T, define X = 14 € L~(T) and the o-algebras C, = 
o(Aj,...,An). By Lévy’s upward theorem (3.4.2), 


X =E[X|C] = lim E[X|C,] . 
But since C, is independent of A, and X is C, measurable, we have 
P[A] = E[X] = E[X|C,] — X 


and because X takes only the value 0 or 1 and X = P[A] shows that it 
must be constant, we get P[A] = 1 or P[A] = 0. O 


Definition. A sequence A_, of o-algebras A_, satisfying 
CA nC Anais CCA 
is called a downward filtration. Define A_.. = [),, A-n. 


Theorem 3.4.4 (Lévy’s downward theorem). Given a downward filtration 
A_n and X € L. Define X_, = E[X|A_n]. Then X_o = limnsoo X-n 
converges in £' and X_o = E[X|A_co]. 
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Proof. Apply Doob’s up-crossing lemma to the uniformly integrable mar- 
tingale 
X, ky wn < k < -1: 


for all a:< b, the number of up-crossings is bounded 
Uxla, d] < (lal + ||X||1)/(6 — a) . 


This implies in the same way as in the proof of Doob’s convergence theorem 
that lim,;... Y_n converges almost everywhere. 

We show now that X_.. = E[X|A_.o]: given A € A_oo. We have E[X; A] = 
E[X_n; A] — E[X_..; A]. The same argument as before shows that X_. = 
E[X; Aco]. oO: 


Lets also look at a martingale proof of the strong law of large numbers. 


Corollary 3.4.5. Given X, € £! which are IID and have mean m. Then 
Sy/n—> min L’. 


Proof. Define the downward filtration A_n = 0(Sn,Sn4i,---)- 

Since E[X,|A_n] = E[X;|A_n] = E[Xi|Sn, Sn4i,...] = Xi, and E[X1|A,] = 
S,/n. We can apply Lévy’s downward theorem to see that S,,/n converges 
in £}. Since the limit X is in T, it is by Kolmogorov’s 0-1 law a constant 
cand c = E[X] = limp. E[S,/n] =m. 0 


3.5 Doob’s decomposition of a stochastic process 


Definition. A process X,, is increasing, if P[X, < Xn41] = 1. 


Theorem 3.5.1 (Doob’s decomposition). Let X, be an A,-adapted L'- 
process. Then 
X=Xjt+N+A 


where N is a martingale null at 0 and A is a previsible process null at 0. 
This decomposition is unique in L}. X is a submartingale if and only if A 
is increasing. 


Proof. If X has a Doob decomposition X = Xo + N + A, then 


E[Xn—Xn-1|An-1] = E[Nn—Nn-1|An] +E[An—An-1|An-1] =< An—An-1 
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which means that 
An = S_E[X¢ — Xk-1|An-1] - 
k=1 
If we define A like this, we get the required decomposition and the sub- 


martingale characterization is also obvious. O 


Remark. The corresponding result for continuous time processes is deeper 
and called Doob-Meyer decomposition theorem. See theorem (4.17.2). 


Lemma 3.5.2. Given s,t,u,u € N withs <t<u< vv. If X, isa L's 
martingale, then 
E[(X; — Xs)(X» — Xu)] = 0 


and 


E[X2] = E[X3] + 5>E (Xe — Xe-a)’] - 


k=1 


Proof. Because E[Xy — Xu|Au] = Xu — Xu = 0, we know that X, — Xu 
is orthogonal to £?(A,,). The first claim follows since X; — Xz € L?(A,). 
The formula 


n 
Xn = Xot+ So (X — Xk-1) 
k=1 


expresses X,, as a sum of orthogonal terms and Pythagoras theorem gives 
the second claim. O 


Corollary 3.5.3. A C?-martingale X is bounded in C? if and only if 
Soke El(Xe — Xk-1)?] < ov. 


Proof. 
E[X2] = E[X@]+— El(Xe—Xe-1)7] < E[X2|+5~ E[(X~—Xk-1)"] < 00. 
k=1 k=1 


If on the other hand, X, is bounded in £7, then ||Xp|l2 < K < oo and 
Dx El(Xn — Xe-1)*] < K + E[XG]. 
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Theorem 3.5.4 (Doob’s convergence theorem for L?-martingales). Let X, 
be a £L?-martingale which is bounded in L?, then there exists X € £L? such 
that X, > X in £?. 


Proof. If X is bounded in £*, then, by monotonicity of the norm |X |la < 
|\|X||2, it is bounded in L' so that by Doob’s convergence theorem, X, > X 
almost everywhere for some X. By Pythagoras and the previous corol- 
lary (3.5.3), we have 


E((X —Xn)?] < So EBl(Xe — Xe-1)"] 3 0 
k>n4+1 


so that xy Xk in 0 O 


Definition. Let X,, be a martingale in £L? which is null at 0. The conditional 
Jensen’s inequality (3.1.4) shows that X? is a submartingale. Doob’s de- 
composition theorem allows to write X? = N+ A, where N is a martingale 
and A is a previsible increasing process. Define Ax. = limn-.9 An point 
wise, where the limit can take the value oo also. One writes also (X) for A 
so that 


X27 =N+(X). 


Lemma 3.5.5. Assume X is a £?-martingale. X is bounded in £? if and 
only if E[(X) 0] < oo. 


Proof. From X? = N + A, we get E[X2] = E[A,] since for a martingale N, 
the equality E[Nn] = E[No] holds and N is null at 0. Therefore, X is in L? 
if and only if E[A.] < oo since = E[X?2] = E[A,] and A, is increasing. O 


We can now relate the convergence of X, to the finiteness of Aggy = (X)oo. 
Proposition 3.5.6. a) If limp. Xn(w) converges, then A..(w) < oo. 


b) On the other hand, if ||Xn — Xn-1|lo < K, then A(w) < 00 implies 
the convergence of limn—oo Xn(w). 
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Proof. a) Because A is previsible, we can define for every k a stopping time 
S(k) = inf{n € N | An41 > k }. The stopped process AS(*) is previsible 
because for B € Ba andneN, 


{Anas(r) © B} = Ar U Ag 


with 
n-1 
Ar = (J{S(k) =t) 4; € B} € Ana 
i=0 
Ag = {Ane BYN{S(k) <n-1}©€ Ani. 
Since 


(xo _ Avr _ (x? = A)S() 


is a martingale, we see that (X°(*) = AS(*). The later process A5(*) is 
bounded by k so that by the above lemma X°(*) is bounded in £?. There- 
fore limp oo Xnas(k) exists almost surely. Combining this with 


{Aco < co} = | J{S¢ = 00} 


k 


proves the claim. 
b) Suppose the claim is wrong and that 


P[Ag = 00, sup|X,| < oo] > 0. 

Then, 

P{T(c) = 00; Ago = co] > 0 
where T(c) is the stopping time 

T(c) = inf{n | |X,| >c}. 
Now 

E[Xi (ean a Ar(ean] =0 
and X7() is bounded by c+ K. Thus 

E[Ar(e)an] < (e+ K)? 


for all n. This is a contradiction to P[A. = 00, sup, |Xn|< oo] >0. O 


Theorem 3.5.7 (A strong law for martingales). Let X be a £L?-martingale 
zero at 0 and let A = (X). Then 


An _.9 


An 


almost surely on {A =o }. 
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Proof. (i) Césaro’s lemma: Given 0 = bp < 01 <...,bn < bn41 — 00 and a 
sequence v, € R which converges Un — Voo, then - Wei (Oe — be-1) UR 


Voo- 


Proof. Let € > 0. Choose m such that vp > Vo. — € if k > m. Then 


epi ple 
lim inf = dar —bi1)te 2 liminf > dalbe — by-1) Ue 
iL eS 


V 
o 
+ 
3 
8 

| 
faa) 


Since this is true for every « > 0, we have liminf > vo. By a similar 
argument limsup > Uoo. [J 


(ii) Kronecker’s lemma: Given 0 = bo < bi <...,0n < bat — oo and a se- 
quence 2, of real numbers. Define s, = 21 +:+-+2n. Then the convergence 
of Un = > >p-1 Tk /be implies that s,/bn — 0. 


Proof. We have Un — Un-1 = Zn/bn and 


n 


n 
Sn = S be (us — Uk—-1) = bnUn — >> (b — bp—1)Uk-1 - 


k=1 k=1 


Césaro’s lemma (i) implies that s,/b, converges to Uo — Uoo = 0. 0 


(iii) Proof of the claim: since A is increasing and null at 0, we have An > 0 
and 1/(1+An) is bounded. Since A is previsible, also 1/(1+An) is previsible, 
we can define the martingale 


". X4— Xk 
Wa =(f0+ AX = 


k=1 
Moreover, since (1 + An) is An—j-measurable, we have 
E((W,-Wn-1)?|An—1 oe (14+An)~7(An—An-1) < (14+An—1)~1-(1+An) 7? 


almost surely. This implies that (W). < 1 so that limn—.oo Wn exists 
almost surely. Kronecker’s lemma (ii) applied point wise implies that on 
{Ano = oo} 

jim Xn/(1 + An) = Jim Xn/An 30. 
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3.6 Doob’s submartingale inequality 


Theorem 3.6.1 (Doob’s submartingale inequality). Let X be a non-negative 
submartingale. For any « > 0 


e-P[ sup X, >] < E[X,;{ sup X; > }] < E[X,] . 
1<k<n l<k<n 


Proof. The set A = {sup} <<, Xx > €} is a disjoint union of the sets 
Apo — {Xo > €} € Ao 


Ar = {Xe >e}N(LJ AN € Ax. 


i=0 
Since X is a submartingale, and X; > on A, we have fork <n 
E[Xn; Ax] > E[X4; Ax] > €P[Ag] - 
Summing up from k = 0 to n gives the result. O 


We have seen the following result already as part of theorem (2.11.1). Here 
it appears as a special case of the submartingale inequality: 


Theorem 3.6.2 (Kolmogorov’s inequality). Given X, € CL? IID with 
E[X;] = 0 and S, = 0y_, Xx. Then for € > 0, 


Pl sup |S¢|>¢ < VarSel 
1<k<n € 


Proof. S,, is a martingale with respect to A, = 0(X1,X9,... , Xn). Because 
u(x) = 2? is convex, S? is a submartingale. Now apply the submartingale 
inequality (3.6.1). QO 


Here is an other proof of the law of iterated logarithm for independent 
N(0,1) random variables. 


Theorem 3.6.3 (Special case of law of iterated logarithm). Given X,, IID 
with standard normal distribution N(0, 1). Then limsup,_,,, Sn/A(n) = 1. 


TS ee SSSSSSSSSSSSSSSSSSSSSSSSshheFesesesssSseFseseeeee 
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Proof. We will use for 


1= (2) =f oty) dy= fn) exp(-v?/2) dy 
the elementary estimates 
(c+2™")"'9(x) $1— (x) < 27*4(z) . 


(i) S, is a martingale relative to An = 0(X1,...,Xn). The function 2 + 
e% is convex on R so that e®5" is a submartingale. The submartingale 
inequality (3.6.1) gives 


P[ sup S, >¢]=P[ sup €°5 > &%] < 7% E[e9] = ee" 0/2 | 
1gk<n 1<k<n 


For given € > 0, we get the best estimate for 6 = e/n and obtain 


P[ sup S_ >] < ene /(2n) 
1<k<n 


(ii) Given K > 1 (close to 1). Choose en = KA(K"~'). The last inequality 
in (i) gives 


P[ sup Sy > én] < exp(—e2/(2K")) = (n— 1)~* (log K)-* 
1<k<K™ 


The Borel-Cantelli lemma assures that for large enough n and K""1<k < 
Kk” 
Se< sup Sp<en= KA(K"~*) < KA(k) 
1<k<K" 


which means for K > 1 almost surely 


li <K. 
Yow ACR) nes 


By taking a sequence of K’s converging down to 1, we obtain almost surely 


Sk 
limsup "> <1. 
"Eee ACK) 


(iii) Given N > 1 (large) and 5 > 0 (small). Define the independent sets 
An = {S(N"*") — S(N") > (1-6)A(N"*1 — N")} 


Then 

P[An] = 1 — ®(y) = (2m)? (yt y7t) tev? 
with y = (1—6)(2loglog(N”~1 — N))}/2. Since P[A,] is up to logarithmic 
terms equal to (nlog N)~(-9)’, we have >, P[An] = co. Borel-Cantelli 
shows that P{lim sup,, An] = 1 so that 


S(N"*") > (1 — d)A(N"*1 — N") + S(N”). 
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By (ii), S(N”) > —2A(N™) for large n so that for infinitely many n, we 
have 
S(N™*1) > (1— d)A(N"*? — N”) — 2A(N”) . 


It follows that 


ee Sunt 1 2 
limsup >" 2 limsup Sym >(1-4)1- oo a ge oe 


3.7 Doob’s CL? inequality 


Lemma 3.7.1. (Corollary of Hélder inequality) Fix p > 1 and q satisfying 
p-'+q71 =1. Given X,Y € CL? satisfying 


eP(|X| 2 €] < EllY]; |X| 2 4] 


Ve > 0, then ||X||p < 4° ||¥Ilp- 


Proof. Integrating the assumption multiplied with pe?~? gives 
Le / peP—*P||X| > e] de < i peP EY |; |X| > el de= RK. 
0 0 
By Fubini’s theorem, the the left hand side is 


b= [Be Nyxizgl de= Bf pe uxeade = EILXP. 
0 0 


Similarly, the right hand side is R = Efg-|X|?~!|Y|]. With Hélder’s in- 
equality, we get 


E[|X|?] < ElglXP*1¥ |] < all¥ lp - MXP lla - 


Since (p — 1)q = p, we can substitute |||X|?~+||, = E[|X|?]/4 on the right 
hand side, which gives the claim. 0 


Theorem 3.7.2 (Doob’s L? inequality). Given a non-negative submartingale 
X which is bounded in £?. Then X* = sup, X,, is in L? and satisfies 


I|X*|| < q: sup ||Xnl|p . 
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Proof. Define X* = SUP} <k<n Xk for n € N. From Doob’s submartingale 
inequality (3.6.1) and the above lemma (3.7.1), we see that 


XR llp < W|Xnllp = gsup ||Xnl{p : 
nr 


O 


——————-— —————s— ee SSSSSSS———OFsssSSsSsSs 


Corollary 3.7.3. Given a non-negative submartingale X which is bounded 
in £’. Then Xoo = limnoo Xn exists in Lp and ||Xcollp = sup, XnIlp - 


Proof. The submartingale X is dominated by the element X* in the L?- 
inequality. The supermartingale —X is bounded in £L? and so bounded in 
L’. We know therefore that Xoo = limn—oo Xn exists almost everywhere. 
From |Xn — Xo0|? < (2X*)? € L? and the dominated convergence theorem 
we deduce X, — Xq in L?. 0 


Corollary 3.7.4. Given a martingale Y bounded in L? and X = |Y|. Then 


Xo = lim X, 


n—- OO 


exists in Ly and ||Xoo||p = sup, ||Xnl|p - 


Proof. Use the above corollary for the submartingale X = |Y|. Oj 


Theorem 3.7.5 (Kakutani’s theorem). Let X, be a non-negative indepen- 
dent £’ process with E[X,,] = 1 for all n. Define Sp = 1 and S, = Tee: Xe- 
Then S.. = limn Sp exists, because S,, is a nonnegative £' martingale. 
Then Sp is uniformly integrable if and only if []°., E[X2/7] > 0. 


Proof. Define an = E[X7/ le The process 


7 xi? xe xls 


Th 
Qa a2 an 
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is a martingale. We have E[T?] = (a1a2---an)~? < ([],, an)? < co 80 that 
T is bounded in £2, By Doob’s £?-inequality 


E[sup |S;,|] < E[sup 7,17) < 4sup E(|Tn|7] < 00 


so that S is dominated by S* = sup,,|Sn| € £'. This implies that S is 
uniformly integrable. 


If S, is uniformly integrable, then 5S, — So. in £!. We have to show that 
Ip, an > 0. Aiming to a contradiction, we assume that II, an = 0. The 
martingale 7’ defined above is a nonnegative martingale which has a limit 
Too. But since [],, an = 0 we must then have that S,. = 0 and so S, — 0 
in L’. This is not possible because E[S,,] = 1 by the independence of the 
Xn- Oo 


Here are examples, where martingales occur in applications: 


Example. This example is a primitive model for the Stock and Bond mar- 
ket. Given a < r < b < oo real numbers. Define p = (r — a)/(b— a). Let €n 
be IID random variables taking values 1, —1 with probability p respectively 
1—p. Define a process B,, (bonds with fixed interest rate f) and S,, (stocks 
with fluctuating interest rates) by 


By, = (1 +r)" Bn—1, Bo =1 


Sn = (1 + Rr)Sn-1; So =1 


with R, = (a + b)/2 + €n(a — b)/2. Given a sequence A, (the portfolio), 
your fortune is X, and satisfies 


Xn = (1+17)Xn-1 + AnSn—-i(Rn — 1). 


We can write R, —r = $(b—a@)(Zp — Zn—1) with the martingale 


nr 
Zn = (ee — 2p +1). 
k=1 


The process Y, = (1+ r)~"X,, satisfies then 


Yn aes Yn-1 = (1 + tT) AgSa-1( Re rad r) 
1 
= glo ~ a)(1 or Tr)" AnSn-1(Zn at Zn-1) 
= Cr(Zn — Zn-1) 


showing us that Y is the stochastic integral f C dZ. So, if your portfolio 
A,, is previsible (A,_; measurable), then Y is a martingale. 


Example. Let X, X1,X2... be independent random variables satisfying 
that the law of X is N(0,07) and the law of X; is N(0,0?). We define the 
random variables 

Y¥, =X + Xz 
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which we consider as a noisy observation of the random variable X. Define 
An = 6(X1,...,Xp) and the martingale 


= E[X|A,] . 


By Doob’s martingale convergence theorem (3.5.4), we know that M,, con- 
verges in CL” to a random variable M,,. One can show that 


E[(X -— Mn) ote Doi” 


This implies that X = M,, if and only if >, 07? = oo. If the noise grows 
too much, for example for o, = n, then we can not recover X from the 
observations Y,. 


3.8 Random walks 


Consider the d-dimensional lattice Z¢, where each point has 2d neighbors. 
A particle starts at the origin 0 € Z? and makes in each time step a random 
step into one of the 2d directions. What is the probability that the particle 
returns back to the origin? 


Definition. Define a sequence of IID random variables X,, which take values 
in 


d 
I= {ee Z"||el= > olel=1} 
i=1 


and which have the uniform distribution defined by P[X, = e] = (2d)~! 
for all e € I. The random variable S,, = Soon X; with So = 0 describes 
the position of the particle at time n. The discrete stochastic process Sy, is 
called the random walk on the lattice Z4. 


Figure. A random walk sample 
path S\(w),...,Sn(w) in the lat- 
tice Z? after 2000 steps. By(w) 
is the number of revisits of the 
starting points 0. 


As a probability space, we can take Q = IN with product measure VN, 


where v is the measure on E, which assigns to each point e the probability 
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v({e}) = (2d)~!. The random variables Xp, are then defined by Xn(w) = 
wn. Define the sets An = {Sn = 0 } and the random variables 


Y, =1a, - 


if the particle has returned to position 0 € Z¢ at time n, then Y, = 1, 
otherwise Y, = 0. The sum By, = aie Y;, counts the number of visits of 
the origin 0 of the particle up to time n and B = YS heo Ye counts the total 
number of visits at the origin. The expectation 


fo. <] 
E[B] = 5_ P[S, = 0] 
n=0 
tells us how many times a particle is expected to return to the origin. We 


write E[B] = 00, if the sum diverges. In this case, the particle returns back 
to the origin infinitely many times. 


Theorem 3.8.1 (Polya). E[B] = 00 for d = 1,2 and E[B] < oo for d > 2. 


Proof. Fix n € N and define a\™(k) = P[S, = k] for k € Z*, Because 
the particle can reach in time n only a bounded region, the function a : 
74 — R is zero outside a bounded set. We can therefore define its Fourier 


transform 
ds,(2) = Da (ben** 
kez¢ 


which is smooth function on T¢ = R¢/Z?. It is the characteristic function 
of S, because 
Efe*5*] = 5 P{Sn = kle** . 
keZe 
The characteristic function ¢x of X_ is 


d 
oo 1 Qnrixz _ 1 : 
x(t) = 55 yeni = ; y Rnetes ; 


\j1=1 i= 


Because the S,, is a sum of n independent random variables X; 
1 <f 
dS, = PX, (£)Ox, (x)... Ox, (£) = ae cos(27z;))” . 
i=1 


Note that ¢s,,(0) = P[Sp = 0]. 


We now show that E[B] = 37,59 ¢s, (0) is finite if and only if d < 3. The 
Fourier inversion formula gives — 


= = = n — 1 
Tris =a= [Vox arm [5 ae. 
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A Taylor expansion $x (x) = 1—), (27)? +... shows 


1 (27)? Qn)? 
5 ee <1 dx(e) <2. Cap 


The claim of the theorem follows because the integral 


/ se dx 
{lel<e} {zl 


over the ball of radius € in R@ is finite if and only if d > 3. O 


Corollary 3.8.2. The particle returns to the origin infinitely often almost 
surely if d < 2. For d > 3, almost surely, the particle returns only finitely 
many times to zero and P[limp_.oo |Sn| = 00] = 1. 


Proof. If d > 2, then A, = limsup,, An is the subset of 9, for which the 
particles returns to 0 infinitely many times. Since E[B] = 7°, P[Anl, 
the Borel-Cantelli lemma gives P[A..] = 0 for d > 2. The particle returns 
therefore back to 0 only finitely many times and in the same way it visits 
each lattice point only finitely many times. This means that the particle 
eventually leaves every bounded set and converges to infinity. 


If d < 2, let p be the probability that the random walk returns to 0: 


p= Pl) An] : 


Then p™~! is the probability that there are at least m visits in 0 and the 
probability is p”—! —p™ = p™—1(1~—p) that there are exactly m visits. We 
can write 


E(B] = >> mp™"\(1 - p) = = 


m>1 


Because E[B] = 00, we know that p = 1. O 


The use of characteristic functions allows also to solve combinatorial prob- 
lems like to count the number of closed paths starting at zero in the graph: 


Proposition 3.8.3. There are 


d 
ivn))” day ---d 
iE O cos(2miz,))” dxy Ld 


closed paths of length n in the lattice Z?. 
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Proof. If we know the probability P[S, = 0] that a path returns to 0 inn 
step, then (2d)"P[S,, = 0] is the number of closed paths in Z of length n. 
But P[S,, = 0] is the zeroth Fourier coefficient 


1 
a és, (x) dx = L wh cos(27ix;,))” dx 


of ds, . O 


Example. In the case d = 1, we have 


1 
| 2?” cos?" (2rz) dx = ( “ih ) 
a n 


closed paths of length 2n starting at 0. We know that also because 


PiSn=0]= (7) oe 


The lattice Z¢ can be generalized to an arbitrary graph G which is a regular 
graph that is a graph, where each vertex has the same number of neighbors. 
A convenient way is to take as the graph the Cayley graph of a discrete 
group G with generators a1,..., aq. 


TT SSSSSSSSSSSSSSSSSSSSSeeS 


Corollary 3.8.4. If G is the Cayley graph of an Abelian group G then the 
random walk on G is recurrent if and only at most two of the generators 
have infinite order. 


Proof. An Abelian group G is isomorphic to Z* x Zn, X ..-Zn,. The char- 
acteristic function of X,, is a function on the dual group G 


oo oo oo : 7 ; 
LP == 3° fdsate to= fo) a= fa re 


is finite if and only if G contains a three dimensional torus which means 
k > 2. O 


The recurrence properties on non-Abelian groups is more subtle, because 
characteristic functions loose then some of their good properties. 


Example. An other generalization is to add a drift by changing the prob- 
ability distribution v on J. Given p; € (0,1) with Vjj=1 Pi = 1. In this 


case : 
x0) = So pyeP 


\g|=1 
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We have recurrence if and only if 


1 
ners dz = oo. 


Take for example the case d = 1 with drift parameterized by p € (0,1). 
Then 


ox (x) = pe?" + (1 — p)e?""* = cos(2rx) + i(2p — 1) sin(272) . 


which shows that 
dx <a 


1 
is 1- 0) xX (x) 
if p #.1/2. A random walk with drift will almost certainly not return to 0 
infinitely often. 


Example. An other generalization of the random walk is to take identically 
distributed random variables X, with values in J, which need not to be 
independent. An example which appears in number theory in the case d = 1 
is to take the probability space Q = T! = R/Z, an irrational number a and 
a function f which takes each value in J on an interval [#, 441). The 
random variables X,(w) = f(w+nqa) define an ergodic discrete stochastic 
process but the random variables are not independent. A random walk 
Sn = sa X, with random variables X; which are dependent is called a 
dependent random walk. 


Figure. If Y;, are IID random 
variables with uniform distri- 
bution in [0,a], then Z, = 
yee1 Ye mod 1 are dependent. 
Define X, = (1,0) if Z, € 
(0, 1/4), Xk = (—1,0) if Zp € 
(1/4, 1/2), Xk = (0,1) if Zp € 
[1/2,3/4) and X, = (0,-1) if 
Zr € [3/4,1). Also X_ are no 
more independent. For small a, 
there can belong intervals, where 
X, is the same because Z, stays 
in the same quarter interval. The 
picture shows a typical path of 
the process Sn = Y-p_1 Xk. 


Example. An example of a one-dimensional dependent random walk is the 
problem of ”almost alternating sums” [52]. Define on the probability space 
Q = ((0,1],A,dz) the random variables X,(x) = 21)9,1/2)(z + na) — 1, 
where a is an irrational number. This produces a symmetric random walk, 
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but unlike for the usual random walk, where S,,(x) grows like \/n, one sees 
a much slower growth S,(0) < log(n)? for almost all a and for special 
numbers like the golden ratio (\/5 + 1)/2 or the silver ratio V2+1 one has 
for infinitely many n the relation 


a: log(n) + 0.78 < S,(0) < a-log(n) +1 


with a = 1/(2log(1+/2)). It is not known whether S,,(0) grows like log(n) 
for almost all a. 


Figure. An almost periodic ran- 
dom walk in one dimensions. In- 
stead of flipping coins to decide 
whether to go up or down, one 
turns a wheel by an angle a after 
each step and goes up if the wheel 
position is in the right half and 
goes down if the wheel position is 
in the left half. While for periodic 
a the growth of S,, is either lin- 
ear (like for a = 0), or zero (like 
for a = 1/2), the growth for most 
irrational a seems to be logarith- 
mic. 


3.9 The arc-sin law for the 1D random walk 


Definition. Let X,, denote the {—1, 1}-valued random variable with P[S, = 
+1] = 1/2 and let S, = S7{_, Xx be the random walk. We have seen that 
it is a martingale with respect to X,,. Given a € Z, we define the stopping 
time 

T, = min{n EN |S, =a}. 


Theorem 3.9.1 (Reflection principle). For integers a,b > 0, one has 


Pla+S, =b, T-a <n] =P[S,=a+]. 


Proof. The number of paths from a to 6 passing zero is equal to the number 
of paths from —a to b which in turn is the number of paths from zero to 
a+b. O 
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Figure. The proof of the reflec- 
tion principle: reflect the part of 
the path above 0 at the line 0. To 
every path which goes from a to 
b and touches 0 there corresponds 
a path from —a to b. 


The reflection principle allows to compute the distribution of the random 
variable T_,: 


Theorem 3.9.2 (Ruin time). We have the following distribution of the stop- 
ping time: 

a) P[T_a <n] = P[S, < —a]+ P[S, > a]. 

b) P[T_. =n] = £P[S, = al. 


Proof. a) Use the reflection principle in the third equality: 


PiT_a<n] = ) P[T-a<n,a+ Sn, =) 
beZ 
= S°Plat+S,=b]+) > PiT-a<n,at Sn = 
b<0 b>0 
= S°Pla+Sn=6]+ >) P[Sn=a+ | 
b<0 b>0 


= P{S, < —a]+P[S, > a] 


b) From 
P[S, =a] = ( pies ) 
we get 
“P{S, = a= 5(PISn-1 gS Ps Saal 
Also 


P{S, >a, Sn-1 <a] 
+P[Spn >@, Sn—1 > a] — P[Sp-1 > al 


= 5(P[S,1=4)-P[S.1 =a +1) 


P[Sn > a] — P[S,-1 > a] 
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and analogously 
1 
P[S, < —a] — P[S,_1 < —a] = 9 (PISn—1 =a-1]—P[S,_-1 =a). 
Therefore, using a) 
P[T_, =n] PIT_a <n|-P[T_a<n-]] 
= P[S, < —a] —P[S,_1 < —a] 
P[S, > a] - P[S,-1 > a] 


= 5(PLSn-1 = a] — P[S,-1 =a + 1]) 


3 PISn—1 =a- 1] oF P[Sp-1 an a]) 


= 5(P[Sn—1 =a-1]-P[S,_;=a+1])= ~P{Sn = al 
O 


ee 
Theorem 3.9.3 (Ballot theorem). 


P[Sn =a, 51 >0,...,Sp-1 >] = - -P[Sp =a]. 
ne ee A 
Proof. When reversing time, the number of paths from 0 to a of length n 


which do no more hit 0 is the number of paths of length n which start in 
a and for which T_, = n. Now use the previous theorem 


P[T_¢ =n] = ~PISn =a]. 
0 
eee 
Corollary 3.9.4. The distribution of the first return time is 


P[Tp > 2n} = P[Son = 0). 


1 
SPIT-1 > 2n-1) + Pin >2n-1] 


Hark 
o 
Vv 
dO 
ce 
i 


= P([T_1>2n-1] ( by symmetry) 
= P[Sn—1 >—land Syn_) < 1] 

= P[Son-1 E {0, 1}] 

= P[S2n-1 = 1] = P[Sen = 0] . 
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Remark. We see that limnoP[Ip > 2n] = 0. This restates that the 
random walk is recurrent. However, the expected return time is very long: 


E[To| = S > nP[To =n|= SPIN >n = 5 PS, = 0] =0 
n=0 n=0 n=0 


because by the Stirling formula n! ~ n"e~"./2mn, one has ( cs ) x 
2?” /,/mn and so 


2n 1 = 
P{San = = (7 ) sae ~ ony 2? 
Definition. We are interested now in the random variable 
L(w) = max{0 <n < 2N | Sp(w) = 0} 


which describes the last visit of the random walk in 0 before time 2N. If 
the random walk describes a game between two players, who play over a 
time 2N, then L is the time when one of the two players does no more give 
up his leadership. 


a SSS ee 


Theorem 3.9.5 (Arc Sin law). L has the discrete arc-sin distribution: 


1 2n 2N — 2n 
Pize m= gle (%) (282%) 
and for N — oo, we have 


P(— <2] = aresin( V2) 


Proof. 
P[L = 2n] = P[S2n = 0] - P[To > 2N — 2n} = P[San = 0] - P[S2n—2n = 0] 


which gives the first formula. The Stirling formula gives P[S2. = 0] ~ Tk 
so that 
PIL = 2k|= SH) 
ieee N°\N 


se 
7 Jk(N—k) 
1 


n/t(1—2) 


with 


f(x) = 
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It follows that 


Pix <2] is f(x) dz = : arcsin(,/z) . 


-0.2 9.2 0.4 0.6 0.8 i 


Figure. The distribution function Figure. The density function of 
P[L/2N < 2] converges in the this distribution in the limit N — 
limit N — oo to the function oo is called the arc-sin distribu- 
2 arcsin(./z)/7. tion. 


Remark. From the shape of the arc-sin distribution, one has to expect that 
the winner takes the final leading position either early or late. 


Remark. The arc-sin distribution is a natural distribution on the interval 
[0,1] from the different points of view. It belongs to a measure which is 
the Gibbs measure of the quadratic map z +> 4-2(1 — x) on the unit 
interval maximizing the Boltzmann-Gibbs entropy. It is a thermodynamic 
equilibrium measure for this quadratic map. It is the measure y on the 
interval [0,1] which minimizes the energy 


1 1 
1) =~ ff rogle ~ 2 due) due") 


One calls such measures also potential theoretical equilibrium measures. 


3.10 The random walk on the free group 


Definition. The free group Fy with d generators is the set of finite words 
w written in the 2d letters 


-1-1 -1 
A = {@1,42,...,@g,4, ,4_°,...,a, } 
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modulo the identifications aja; a a; lq; = 1. The group operation is 


concatenating words v o w = vw. The inverse of w = w,w2---Wn is w! = 
w,,!+--wa'w;,). Elements w in the group Fy can be uniquely represented 
by reduced words obtained by deleting all words vv—! in w. The identity 
e in the group Fy is the empty word. We denote by /(w) the length of the 


reduced word of w. 


Definition. Given a free group G with generators A and let X;, be uniformly 
distributed random variables with values in A. The stochastic process S, = 
X,---Xzy, is called the random walk on the group G. Note that the group 
operation X;-needs not to be commutative. The random walk on the free 
group can be interpreted as a walk on a tree, because the Cayley graph of 
the group Fy with generators A contains no non-contractible closed circles. 


Figure. Part of the Cayley graph 
of the free group F2 with two gen- 
erators a,b. It is a tree. At ev- 
ery point, one can go into 4 dif- 
ferent directions. Going into one 
of these directions corresponds to 
multiplying with a,a~!,b or b=}. 


Definition. Define for n € N 
Tn = P[Sn =e, S51 Ze, S2#e,...Sn-1 F# e| 


which is the probability of returning for the first time to e if one starts at 
e. Define also for n ¢ N 

Mn = P[S, =e] 
with the convention m() = 1. Let r and m be the probability generating 
functions of the sequences r,, and Mn: 


co CO 
m(z) = > mat", r(x) = yom : 
n=0 n=0 


These sums converge for |x| < 1. 


Lemma 3.10.1. (Feller) 
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Proof. Let T be the stopping time 
T=min{néeN|S, =e}. 


With P[T = n] = rn, the function r(z) = D>, rnz” is the probability 
generating function of T. The probability generating function of a sum in- 
dependent random variables is the product of the probability generating 
functions. Therefore, if T; are independent random variables with distribu- 
tion T, then 5>;"_, T; has the probability generating function 2 > ea) 
We have 


Sweat = 5 Piss ee 


n=0 n=0 


= s P[Sp, = €, Sn, = €,---,5n, = €, 


n=00<1ni <ne<s<ng 


Sn #eforn¢ {n,...,n~ }]2” 


n=0 k=1 k=0 


O 


Remark. This lemma is true for the random walk on a Cayley graph of any 
finitely presented group. 


The numbers r2n+4; are zero for odd 2n+1 because an even number of steps 
are needed to come back. The values of rz, can be computed by using basic 
combinatorics: 


Lemma 3.10.2. (Kesten) 


eee SF ea eet 
ron = Gaya ( ae ) acea a ae 


Proof. We have 
1 
Tan = 5qltwiwe 1 Wn E G,w* = W1W2...WE x e}| ‘ 


To count the number of such words, map every word with 2n letters into 
a path in Z? going from (0,0) to (n,n) which is away from the diagonal 
except at the beginning or the end. The map is constructed in the following 
way: for every letter, we record a horizontal or vertical step of length 1. 
If l(w*) = l(w*-) + 1, we record a horizontal step. In the other case, if 
l(w*) = l(wk-!) — 1, we record a vertical step. The first step is horizontal 
independent of the word. There are 


i 2n —2 
n n—-l 


174 Chapter 3. Discrete Stochastic Processes 


such paths since by the distribution of the stopping time in the one dimen- 
sional random walk 


1 
ST a 


= 1 2n—1 

~ In n 

_ Lf an-2 

~~ n—-1l , 
Counting the number of words mapped into the same path, we see that we 
have in the first step 2d possibilities and later (2d — 1) possibilities in each 


of the n — 1 horizontal step and only 1 possibility in a vertical step. We 
have therefore to multiply the number of paths by 2d(2d — 1)"7!. Oo 


Pify = 2n—- 1] 


Theorem 3.10.3 (Kesten). For the free group Fz, we have 
2d —-1 


tO G+ P= BI 


Proof. Since we know r?” we can compute 


d — ,/d* — (2d — 1)z? 
eS 2d—1 
and get the claim with Feller’s lemma m(z) = 1/(1 — r(z)). QO 


Remark. The Cayley graph of the free group is also called the Bethe lattice. 
One can read of from this formula that the spectrum of the free Laplacian 
L : 1?(F4) — 1?(F4) on the Bethe lattice given by 


Lu(g) =D) ug +a) 


acA 


is the whole interval [—a, a] with a = 2\/2d— 1. 


Corollary 3.10.4. The random walk on the free group Fy with d generators 
is recurrent if and only if d = 1. 


Proof. Denote as in the case of the random walk on Z? with B the random 
variable counting the total number of visits of the origin. We have then 
again E/B] = 5°, P[Sn = e] = 7, mn = m(1). We see that for d = 1 we 
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have m(1) = co and that m(d) < oo for d > 1. This establishes the analog 
of Polya’s result on Z? and leads in the same way to the recurrence: 

(i) d= 1: We know that Z, = F,, and that the walk in Z! is recurrent. 
(ii) d > 2: define the event A, = {S, = e}. Then A. = limsup,, An is the 
subset of 0, for which the walk returns to e infinitely many times. Since 
for d > 2, 


CO 
E[B] = }> P[A,|m(d) < 0, 
n=0 
The Borel-Cantelli lemma gives P[A..] = 0 for d > 2. The particle returns 
therefore to 0 only finitely many times and similarly it visits each vertex in 
Fq only finitely many times. This means that the particle eventually leaves 
every bounded set and escapes to infinity. O 


Remark. We could say that the problem of the random walk on a discrete 
group G is solvable if one can give an algebraic formula for the function 
m(x). We have seen that the classes of Abelian finitely generated and free 
groups are solvable. Trying to extend the class of solvable random walks 
seems to be an interesting problem. It would also be interesting to know, 
whether there exists a group such that the function m(z) is transcendental. 


3.11 The free Laplacian on a discrete group 


Definition. Let G be a countable discrete group and A C G a finite set 
which generates G. The Cayley graph I’ of (G, A) is the graph with edges 
G and sites (7, j) satisfying i—j € Aorj-—i€ A. 


Remark. We write the composition in G additively even so we do not 
assume that G is Abelian. We allow A to contain also the identity e € G. 
In this case, the Cayley graph contains two closed loops of length 1 at each 
site. 


Definition. The symmetric random walk on I'(G, A) is the process obtained 
by summing up independent uniformly distributed (AU A~')-valued ran- 
dom variables X,. More generally, we can allow the random variables Xn 
to be independent but have any distribution on AU A~!. This distribution 
is given by numbers pg = p;! € [0, 1] satisfying y edd = 1 


Definition. The free Laplacian for the random walk given by (G, A, p) is 
the linear operator on [?(G) defined by 


Ligh =Pg-h- 


Since we assumed pg = pg-1, the matrix L is symmetric: Lgn = Lng and 
the spectrum 


o(L) ={E €C| (L — E) is invertible } 


is a compact subset of the real line. 
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Remark. One can interpret L as the transition probability matrix of the 
random walk which is a ” Markov chain”. We will come back to this inter- 
pretation later. 


Example. G = Z, A = {1}. p = pa = 1/2 for a = 1,—1 and pz = 0 for 
a ¢ {1,—1}. The matrix 


3 O'S 
3B O'R 
o's 


is also called a Jacobi matrix. It acts on the Hilbert space 1?(Z) by (Lu)n = 
P(tUn+1 + Un-1)- 


Example. Let G = D3 be the dihedral group which has the presentation 
G = (a,b\a> = b? = (ab)? = 1). The group is the symmetry group of the 
equilateral triangle. It has 6 elements and it is the smallest non-Abelian 
group. Let us number the group elements with integers {1,2 = a,3 = 
a?,4 = b,5 = ab,6 = ab }. We have for example 3x 4 = a?b = 6 or 
3x5 = a2ab = a3b = b =4. In this case A = {a,b}, A7! = {a*, b} so that 
AU A7! = {a,a71,b}. The Cayley graph of the group is a graph with 6 
vertices. We could take the uniform distribution pz = pp = Pa-1 = 1/3 on 
AUA7™}, but lets instead chose the distribution pa = Pa-1 =1 /4, py = 1/2, 
which is natural if we consider multiplication by 6 and multiplication by 
b~! as different. 


Example. The free Laplacian on D3 with the random walk transition prob- 
abilities pg = pa-1 = 1/4, pp = 1/2 is the matrix 


0 1/4 1/4 1/2 0 0 
i/4 0 1/4 0 1/2 0 
1/4 0 0 0 0 1/2 
1/2 0 0 O 1/4 1/4 

0 1/72 0 1/4 0 14 

O° 0. 1/2 1/4 2D, . 0 


L= 


which has the eigenvalues (-3 + V5)/8, (5 + V5)/8, 1/4, —3/4. 
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Figure. The Cayley graph of the 
dihedral group G = Dg is a reg- 
ular graph with 6 vertices and 9 
edges. 


A basic question is: what is the relation between the spectrum of L, the 
structure of the group G and the properties of the random walk on G.. 


Definition. As before, let m, be the probability that the random walk. 
starting in e returns in n steps to e and let 


m(xr) = Ss; Mynx” 


ne€G 


be the generating function of the sequence m,. 


Proposition 3.11.1. The norm of L is equal to lim sup,,_,99(™n)*/ ”, the 
inverse of the radius of convergence of m(z). 


Proof. Because L is symmetric and real, it is self-adjoint and the spectrum 
of L is a subset of the real line R and the spectral radius of L is equal to 
its norm ||Z]|- 
We have [Lee = mn since [Lee is the sum of products []j_, pa, each 
of which is the probability that a specific path of length n starting and 
landing at e occurs. 
It remains therefore to verify that 

lim sup ||L"||!/" = lim sup[L"]2/" 

n—-0O noo 

and since the > direction is trivial we have only to show that < direction. 
Denote by E(A) the spectral projection matrix of L, so that dE(A) is a 
projection-valued measure on the spectrum and the spectral theorem says 
that L can be written as L = f \ dE(X). The measure pre = dEee is called 
a spectral measure of L. The real number (A) — E(u) is nonzero if and 
only if there exists some spectrum of L in the interval [A, 4). Since 


at > Pee = fe — )~1 dk(E) 
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can’t be analytic in A in a point Ag of the support of dk which is the 
spectrum of DL, the claim follows. O 


Remark. We have seen that the matrix L defines a spectral measure pe on 
the real line. It can be defined for any group element g, not only g = e and 
is the same measure. It is therefore also the so called density of states of L. 
If we think of yu as playing the role of the law for random variables, then 
the integrated density of states E(A) = Fx (A) = f_., du(A) plays the role 
of the distribution function for real-valued random variables. 


Example. The Fourier transform U : 1?(Z!) — L?(T"): 
a(x) = (Uu)(x2) = Sane 
nezZ 


diagonalises the matrix L for the random walk on Z} 


(ULU")a(x) = ((UL)(Un)(t) = pU (uns + Un—1)(2) 
= PY _(tn+1 + Un—1)e"* 
neZ 
= p>, ttn(er—)# + ei(nt)e) 
neZ 
= p> Unie gem 
neZ : 
= p>, tn2 cos(x)e*™ 
nezZ 


= 2pcos(x)- G(r) . 


This shows that the spectrum of ULU* is [—1,1] and because U is an 
unitary transformation, also the spectrum of L is in [—1, 1]. 


Example. Let G = Z? and A = {e;}4.,, where {e;} is the standard bases. 
Assume p = pq = 1/(2d). The analogous Fourier transform F' : 1?(Z*) > 


L?(T?) shows that FLF* is the multiplication with 4 Sime aya The 
spectrum is again the interval ([—1, 1]. 


Example. The Fourier diagonalisation works for any discrete Abelian group 
with finitely many generators. 


Example. G = F, the free group with the natural d generators. The spec- 
trum of L is 
V2d—1 V2d-1 
[- d ’ d ] 
which is strictly contained in [—-1,1] if d> 1. 


Remark. Kesten has shown that the spectral radius of L is equal to 1 if and 
only if the group G has an invariant mean. For example, for a finite graph, 
where L is a stochastic matrix, for which each column is a probability 
vector, the spectral radius is 1 because L7 has the eigenvector (1,..., 1) 
with eigenvalue 1. 
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Random walks and Laplacian can be defined on any graph. The spectrum 
of the Laplacian on a finite graph is an invariant of the graph but there are 
non-isomorphic graphs with the same spectrum. There are known infinite 
self-similar graphs, for which the Laplacian has pure point spectrum [63]. 
There are also known infinite graphs, such that the Laplacian has purely 
singular continuous spectrum [95]. For more on spectral theory on graphs, 
start with [6]. 


3.12 A discrete Feynman-Kac formula 


Definition. A discrete Schrédinger operator is a bounded linear operator 
L on the Hilbert space 1?(Z*) of the form 


d 
(Lu)(n) = S u(n +e) — 2u(n) + u(n — e:) + V(n)u(n) , 
i=1 
where V is a bounded function on Z*. They are discrete versions of op- 


erators L = —A+V(z) on L?(R*), where A is the free Laplacian. Such 
operators are also called Jacobi matrices. 


Definition. The Schrédinger equation 
ih: = Lu, u(0) = uo 


is a differential equation in 1?7(Z¢,C) which describes the motion of a com- 
plex valued wave function u of a classical quantum mechanical system. The 
constant fi is called the Planck constant and i = /—1 is the imaginary 
unit. Lets assume to have units where fi = 1 for simplicity. 


Remark. The solution of the Schrédinger equation is 


t 
Ut = ei-uo . 


The solution exists for all times because the von Neumann series 


Ee. PL 
tL _ ae Aa [hse ose & 
el aa ere eer a 


is in the space of bounded operators. 


Remark. It is an achievement of the physicist Richard Feynman to see 
that the evolution as a path integral. In the case of differential operators 
L, where this idea can be made rigorous by going to imaginary time and 
one can write for D = -A+V 


e*u(ax) = Ezfelo VOW) 4u9(y(t))] , 


where E, is the expectation value with respect to the measure Pz on the 
Wiener space of Brownian motion starting at z. 
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Here is a discrete version of the Feynman-Kac formula: 
Definition. The Schrodinger equation with discrete time is defined as 
i(Utte = ut) = eLuy: ; 
where € > 0 is fixed. We get the evolution 
Utene = (1 — ieL)" ut 
and we denote the right hand side with Luz. 
Definition. Denote by I',(i,7) the set of paths of length n in the graph 
G having as edges Z? and sites pairs [i, j] with |i — j| < 1. The graph G 
is the Cayley graph of the group Z* with the generators AU A7? U {e}, 


where A = {e1,...,€a, } is the set of natural generators and where e is the 
identity. 


Definition. Given a path ¥ of finite length n, we use the notation 


exp( | L)= Teo. : 
Z 


i=1 


Let is the set of all paths on G and E denotes the expectation with 
respect to a measure P of the random walk on G starting at 0. 


Theorem 3.12.1 (Discrete Feynman-Kac formula). Given a discrete 
Schrédinger operator LD. Then 


(L"u)(0) = Bolex( "L) u(y(n))) 


Proof. 


(L"u)(0) YL" )oju(3) 


) oe ex( | L) u(3) 


i yETn(0,7) 


exp i " Bu(r(n)) - 


yer n 


II 


[3 


Remark. This discrete random walk expansion corresponds to the Feynman- 
Kac formula in the continuum. If we extend the potential to all the sites of 
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the Cayley graph by putting V({k,k]) = V(k) and V((k,l]) = 0 fork £1, 
we can define exp(f, V) as the product Tie, Vilv@), y+ 1)]). Then 


(Lru)(0) = Blexp( f Vyua(m)) 
0 
which is formally the Feynman-Kac formula. 


In order to compute (L"u)(k) with L = (1 — keL), we have to take the 
potential # defined by 


({k, k]) = 1 — tev(y(&)) . 


Remark. The Schrédinger equation with discrete time has the disadvantage 
that the time evolution of the quantum mechanical system is no more 
unitary. This draw-back could be overcome by considering also ih(ut — 
utc) = €Lu, so that the propagator from uz—¢ to Ut+ye is given by the 
unitary operator 


4€ 4€ 
=(1-— —f)\7) 
U=(1 j L)(1+ jh ) 
which is a Cayley transform of L. See also [50], where the idea is disussed 
to use L = arccos(aL), where L has been rescaled such that aL has norm 


smaller or equal to 1. The time evolution can then be computed by iterating 
the map A: (~,¢) + (2aLw — ¢,y) on H OAH. 


3.13 Discrete Dirichlet problem 


Also for other partial differential equations, solutions can be described prob- 
abilistically. We look here at the Dirichlet problem in a bounded discrete 
region. The formula which we derive in this situation holds also in the 
continuum limit, where the random walk is replaced by Brownian motion. 


Definition. The discrete Laplacian on Z? is defined as 
Af(n,m) = f(n+1,m)+f(n-1,m)+f(n,m+1)+f(n,m—1)—4f (n,m) . 


With the discrete partial derivatives 
BE f(n,m) = A(F(at1,m)-f(n,m)), 5: Frm) = 5 F(n,m)=f(n—1,m)) 


Bf Firm) = 3(f(n,m+1)—F(n,m)), 5 Frm) = 5 (f(r m)—F(rm=I)) 


the Laplacian is the sum of the second derivatives as in the continuous case, 
where A = faz + fyy: 

A = 6t6; + 676, . 
The discrete Laplacian in Z? is defined in the same way as a discretisation 
of A = fra + fyy + fzz- The setup is analogue in higher dimensions 


d 


(Au)(n) = 5 So (u(n +e;) +u(n — ej) — 2u(n)) , 


i=1 
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where €1,...,€q is the standard basis in Z?. 


Definition. A bounded region D in Z? is a finite subset of Z¢. Two points 
are connected in D if they are connected in Z?. The boundary 5D of D 
consists of all lattice points in D which have a neighboring lattice point 
which is outside D. Given a function f on the boundary 6D, the discrete 
Dirichlet problem asks for a function u on D which satisfies the discrete 
Laplace equation Au = 0 in the interior int(D) and for which u = f on the 
boundary 6D. 


Figure. The discrete Dirichlet 
problem is a problem in lin- a rae 
ear algebra. One algorithm to eee ee 

solve the problem can be restated ee ee eee eee ee 
as a probabilistic "path integral : : 


method”. To find the value of u : 

at a point x, look at the "dis- : = 
crete Wiener space” of all paths eg be 
starting at x and ending at an Gana _ 


some boundary point Sr(w) € 
5D of D. The solution is u(x) = 


Definition. Let Q,,, denote the set of all paths of length n in D which start 
at a point z € D and end up at a point in the boundary 6D. It is a subset 
of Tz, the set of all paths of length n in Z¢ starting at x. Lets call it the 
discrete Wiener space of order n defined by x and D. It is a subset of the 
set [zn which has 22” elements. We take the uniform distribution on this 
finite set so that Pz n[{y}] = 1/2%. 


Definition. Let L be the matrix for which Lz, = 1/(2d) if x,y € Z¢ are 
connected by a path and z is in the interior of D. The matrix L is a bounded 
linear operator on 1?(D) and satisfies L,,, = Lz,z for, z € int(D) = D\éD. 
Given f : 6D — R, we extend f to a function F(z) =0 on fD=D\6D 
and F(z) = f(a) for x € 6D. The discrete Dirichlet problem can be restated 
as the problem to find the solution u to the system of linear equations 


(l1-L)u=f. 


Lemma 3.13.1. The number of paths in 0,,, starting at z € D and ending 
at a different point y € D is equal to (2d)"Lt,. 
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Proof. Use induction. By definition, Lz, is 1/(2d) if there is a path from x 
to z. The integer Lz.y is the number of paths of length n from x toy. O 


Figure. Here is an example of a 
problem where D C Z? has 10 


points: 
000000000 0 e Ce 
000000000 0 
0000000000 ® e +e 
1010%1004i100 

ire 0103101001 «0 e ry e 
000000000 0 
000000000 +0 e 
000103100121 : 
000000000 0 
0000000000 


Only the rows corresponding to 
interior points are nonzero. 


Definition. For a function f on the boundary 6D, define 
Exnlfl= >> fle, 
yedD 


and 
Ez[f] = )> Exnlf] - 
n=0 


This functional defines for every point x € D a probability measure fy On 
the boundary 6D. It is the discrete analog of the harmonic measure in the 
continuum. The measure P, on the set of paths satisfies E,[1] = 1 as we 
will just see. 


Proposition 3.13.2. Let S,, be the random walk on Z4 and let T be the 
stopping time which is the first exit time of S from D. The solution to the 
discrete Dirichlet problem is 


u(z) = E,[f(Sr)} . 


or oO OOO 


Proof. Because (1 — L)u = f and 
Eznlf] = (L"f)e , 
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we have from the geometric series formula 


n 


(1- A)? = $0 A* 


k=0 


the result 
u(x) = (1 - L)*f() = SOIL" Fle = S> Exwalf] = Ee[Sz] - 
n=0 n=0 


Define the matrix K by Kj; = 1 for j € 6D and Ki; = Lj:/4 else. The 
matrix A is a stochastic matrix: its column vectors are probability vectors. 
The matrix K has a maximal eigenvalue 1 and so norm 1 (KT has the 
maximal eigenvector (1,1,...,1) with eigenvalue 1 and since eigenvalues of 
K agree with eigenvalues of K7). Because ||L|| < 1, the spectral radius of 
L is smaller than 1 and the series converges. If f = 1 on the boundary, 
then u = 1 everywhere. From E,[1] = 1 follows that the discrete Wiener 
measure is a probability measure on the set of all paths starting at z. O 


. of e¢ 
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Figure. The random Figure. The diffusion Figure. The diffusion 
walk defines a diffu- process after timet = _ process after time t = 
sion process. 2. 3. 


The path integral result can be generalized and the increased generality 
makes it even simpler to describe: 


Definition. Let (D,E) be an arbitrary finite directed graph, where D is 
a finite set of n vertices and E Cc D x D is the set of edges. Denote an 
edge connecting i with j with e,;. Let K be a stochastic matrix on /?(D): 
the entries satisfy K;,; > 0 and its column vectors are probability vectors 
iep Kij = 1 for all 7 € D. The stochastic matrix encodes the graph and 
additionally defines a random walk on D if K;; is interpreted as the tran- 
sition probability to hop from j to i. Lets call a point 7 € 6D a boundary 
point, if K;; = 1. The complement intD = D\6D consists of interior points. 
Define the matrix L as Lj; = 0 if j is a boundary point and L,;; = Kj 
otherwise. 


3.13. Discrete Dirichlet problem 185 


The discrete Wiener space 2, Cc D on D is the set of all finite paths y = 
(x = 2,21, %2,...,%p) starting at a point x € D for which Kz,z,,, > 0. 
The discrete Wiener measure on this countable set is defined as Pz[{y}] = 
ava Kj,j41. A function u on D is called harmonic if (Lu), = 0 for all 
x € D. The discrete Dirichlet problem on the graph is to find a function u 
on D which is harmonic and which satisfies u = f on the boundary 6D of 
Dz. 


Theorem 3.13.3 (The Dirichlet problem on graphs). Assume D is a directed 
graph. If S,, is the random walk starting at z and T is the stopping time 
to reach the boundary of D, then the solution 


u=E,[f(Sr)] 


is the expected value of Sy on the discrete Wiener space of all paths starting 
at x and ending at the boundary of D. 


Proof. Let F' be the function on D which agrees with f on the boundary of 
D and which is 0 in the interior of D. The Dirichlet problem on the graph 
is the system of linear equations (1 — L)u = f. Because the matrix L has 
spectral radius smaller than 1, the problem is given by the geometric series 


foe) 


u=SoL"f. 


n=0 


But this is the sum E,[f(Sr)} over all paths + starting at 2 and ending at 
the boundary of f. Oo 


Example. Lets look at a directed graph (D, E) with 5 vertices and 2 bound- 
ary points. The Laplacian on D is defined by the stochastic matrix 


0 1/3 0 0 0 

1/2 0 1 00 

K=]| 1/4 1/2 0 0 0 

1/8 1/6 0 1 0 

1/8 0 001 

or the Laplacian 

0 1/2 1/4 1/8 1/8 

1/3 0 1/2 1/6 O 

L= 0 1 0 0 0 
0 0 0 1 0 

0 0 0 0 1 


Given a function f on the boundary of D, the solution u of the discrete 
Dirichlet problem (1 — L)u = f on this graph can be written as a path 
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integral >>) L"f = E,[f (Sr-)] for the random walk S, on D stopped at 
the boundary 6D. 


Figure. The directed graph 
(D,E) with 5 vertices and 2 
boundary points. 


Remark. The interplay of random walks on graphs and discrete partial 
differential equations is relevant in electric networks. For mathematical 
treatments, see [19, 99]. 


3.14 Markov processes 


Definition. Given a measurable space (S, B) called state space, where S is 
a set and B is a o-algebra on S. A function P : S x B => R is called a 
transition probability function if P(x, -) is a probability measure on (S, B) 
for all x € S and if for every B € B, the map s > P(s, B) is B-measurable. 
Define P'(x,B) = P(z,B) and inductively the measures Pe By = 
J, P"(y, B)P(x,dy), where we write f P(x,dy) for the integration on $ 
with respect to the measure P(z,-). 


Example. If S is a finite set and B is the set of all subsets of $. Given 
a stochastic matrix K and a point s € S, the measures P(s,-) are the 
probability vectors, which are the columns of K. 


Remark. The transition probability functions are elements in L(S, M(S)), 
where M,(S) is the set of Borel probability measures on S. With the mul- 
tiplication 


(Po Q)(c, B) = [ Ply, B) dQ(2) 


we get a commutative semi-group. The relation P?+™ = P™ o P™ is also 
called the Chapmann-Kolmogorov equation. 


Definition. Given a probability space (0,.A,P) with a filtration A, of o- 
algebras. An A,-adapted process X,, with values in S is called a discrete 
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time Markov process if there exists a transition probability function P such 
that 
P[Xn € B | Ag](w) = P"-*(X;.(w), B) . 


Definition. If the state space S is a discrete space, a finite or countable 
set, then the Markov process is called a Markov chain, A Markov chain is 
called a denumerable Markov chain, if the state space S is countable, a 
finite Markov chain, if the state space is finite. 


Remark. It follows from the definition of a Markov process that X,, satisfies 
the elementary Markov property: for n > k, 


P(X, € B| Xi,..., Xx] = P[Xn € B| Xx] . 


This means that the probability distribution of X,, is determined by know- 
ing the probability distribution of X,_1. The future depends only on the 
present and not on the past. 


Theorem 3.14.1 (Markov processes exist). For any state space (5,8) and 
any transition probability function P, there exists a corresponding Markov 
process X. 


Proof. Choose a probability measure yz on (5,8) and define on the prod- 
uct space (Q,.A) = (SN, BN) the m-system C consisting of of cylinder-sets 
TInen Bn given by a sequence B,, € B such that B, = S except for finitely 
many n. Define a measure P = P,, on (Q,C) by requiring 


Plu, € Be, k =1,...n] =| (dz) [ P(ao,dz1)... [ P(Gn21,dtn). 
Bo By Bn 


This measure has a unique extension to the o-algebra A. 

Define the increasing sequence of o-algebras A, = B” x [];_, {0,9} con- 
taining cylinder sets. The random variables X,,(w) = Xn are A”-adapted. 
In order to see that it is a Markov process, we have to check that 


P[X, € By | An-1](w) = P(Xn-1(w), Bn) 


which is a special case of the above requirement by taking By = S for 
kAén. 0 


Example. Independent S-valued random variables 
Assume the measures P(x, -) are independent of x. Call this measure P. In 
this case 

P[Xn € Bn | An—1](w) = P[Bn] 
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which means that P[X, € Bn | An-i] = P[Xn € Bn]. The S— valued 
random variables X,, are independent and have identical distribution and 
P is the law of X,. Every sequence of IID random variables is a Markov 
process. 


Example. Countable and finite state Markov chains. 
Given a Markov process with finite or countable state space S$. We define 
the transition matrix P;; on the Hilbert space 1?(.S) by 


Pig = P(i,{9}) . 
The matrix P transports the law of X, into the law of Xn41. 
The transition matrix P;; is a stochastic matrix: each column is a proba- 
bility vector: ae Py = 1 with Pj; > 0. Every measure on S can be given 
by a vector m € 17(S) and Pz is again a measure. If Xo is constant and 
equal to i and X,, is a Markov process with transition probability P, then 
PR = P[Xn = jj. 


Example. Sum of independent S-valued random variables Let S be a count- 
able Abelian group and let 7 be a probability distribution on S assigning 
to each j € S the weight 7;. Define P,; = 1;_;. Now Xp, is the sum of n 
independent random variables with law x. The sum changes from i to j 
with probability P,; = pi_;. 

Example. Branching processes Given S = {0,1,2... } = N with fixed 
probability distribution 7. If X is a S-valued random variable with distri- 
bution 7 then }°,_, Xx has a distribution which we denote by 7‘”). Define 
the matrix Pj; = w The Markov chain with this transition probability 
matrix on S is called a branching process. 


Definition. The transition probability function P acts also on measures 7 
of S by 
P(x)(B) = / P(az, B) dnr(a) . 
Ss 


A probability measure 7 is called invariant if Pa = 7. An invariant measure 
x on S is called stationary measure of the Markov process. 


This operator on measures leaves a subclass of measures with densities with 
respect to some measure vy invariant. We can so assign a Markov operator 
to a transition probability function: 


Lemma 3.14.2. For any z € S define the measure 
“1 
y(B) = —P"(x,B 
(B) = DPB) 


on (S,B). has the property that if 4 is absolutely continuous with respect 
to v, then also Py is absolutely continuous with respect to v. 
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Proof. Given = f -v with f € L'(S). Then 


Pu = [ P(x, B)f(2) du(c) 


is absolutely continuous with respect to v because Pu(B) = 0 implies 
P(z, B) = 0 for almost all z with f(x) > 0 and so fu(B) = 0. 0 


Corollary 3.14.3. To each transition probability function can be assigned a 
Markov operator P : L1(S,v) — L1(S,v). 


Proof. Choose v as above and define 


Phi = fe 


if Puy = pa with py; = fiv;. To check that P is a Markov operator, we have 
to check Pf > 0 if f > 0, which follows from 


Pfu(B) = | P(e, B)f(e) av(x) > 0. 


We also have to show that ||Pf||1 = 1 if ||f|l1. It is enough to show this 
for elementary functions f = >> ;ajlB, with a; > 0 with B; € B satisfying 
>, 2j/(B;j) = 1 satisfies ||P1gv|| = v(B). But this is obvious ||P1gy|| = 
Jz P(z,:) dv(x) = v(B). Oo 


We see that the abstract approach to study Markov operators on L1(S) is 
more general, than looking at transition probability measures. This point 
of view can reduce some of the complexity, when dealing with discrete time 
Markov processes. 
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Chapter 4 


Continuous Stochastic 
Processes 


4.1 Brownian motion 


Definition. Let (2,A,P) be a probability space and let T C R be time. 
A collection of random variables X;, t € T with values in R is called a 
stochastic process. If X; takes values in S = R?¢, it is called a vector-valued 
stochastic process but one often abbreviates this by the name stochastic 
process too. If the time T can be a discrete subset of R, then X¢ is called 
a discrete time stochastic process. If time is an interval, Rt or R, it is 
called a stochastic process with continuous time. For any fixed w € 22, one 
can regard X;(w) as a function of t. It is called a sample function of the 
stochastic process. In the case of a vector-valued process, it is a sample 
path, a curve in R¢. 


Definition. A stochastic process is called measurable, if X :T x Q— S is 
measurable with respect to the product o-algebra B(T) x A. In the case of 
a real-valued process (S = R), one says X is continuous in probability if 
for any t € R the limit X;,, — X; takes place in probability for h — 0. 
If the sample function X;(w) is a continuous function of ¢ for almost all w, 
_ then X; is called a continuous stochastic process. If the sample function is 
a right continuous function in ¢t for almost all w € Q, Xz is called a right 
continuous stochastic process. Two stochastic process X; and Y; satisfying 
P[X, — ¥; = 0] = 1 for all t € T are called modifications of each other 
or indistinguishable. This means that for almost all w € , the sample 
functions coincide X;(w) = Y;(w). 


Definition. A R”-valued random vector X is called Gaussian, if it has the 
multidimensional characteristic function 


ox (s) = Efe***] = e7 (8:¥s)/2+i(m,s) 
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for some nonsingular symmetric n x n matrix V and vector m = E[X]. The 
matrix V is called covariance matrix and the vector m is called the mean 
vector. 


Example. A normal distributed random variable X is a Gaussian random 
variable. The covariance matrix is in this case the scalar Var[X]. 


Example. If V is a symmetric matrix with determinant det(V) 4 0, then 
the random variable 


(a) = ———) g(a." (—m)) /2 
(2m)"/2,/det(V) 


on 2 = R” is a Gaussian random variable with covariance matrix V. To 
see that it has the required multidimensional characteristic function ¢x (u). 
Note that because V is symmetric, one can diagonalize it. Therefore, the 
computation can be done in a bases, where V is diagonal. This reduces the 
situation to characteristic functions for normal random variables. 


Example. A set of random variables X),...,X, are called jointly Gaussian 
if any linear combination }>;, a:X; is a Gaussian random variable too. 
For a jointly Gaussian set of of random variables X;, the vector X = 
(X1,...,Xn) is a Gaussian random vector. 


Example. A Gaussian process is a R?-valued stochastic process with con- 
tinuous time such that (X1t,, X¢,,-.-, Xt, ) is jointly Gaussian for any to < 
ty < +++ < ty. It is called centered if mz = E[X:] = 0 for all t. 


Definition. An R?@-valued continuous Gaussian process X; with mean vector 
mz = E[X;] and the covariance matrix V(s,t) = Cov[Xs, Xt] = E[(Xs — 
ms):(X:—mz)*] is called Brownian motion if for any 0 < to < ti <--- < tn, 
the random vectors X;,, X¢,,,; — Xt, are independent and the covariance 
matrix V satisfies V(s,t) = V(r,r), where r = min(s,t) and s + V(s,s). 
It is called the standard Brownian motion if m; = 0 for all ¢ and V(s,t) = 
min{s,t}. 


Figure. A path X;(w1) of Brow- 
nian motion in the plane S = R? 
with a drift mz = E[X,] = (t, 0). 
This is not standard Brownian 
motion. The process Y; = Xz — 
(t,0) is standard Brownian mo- 
tion. 
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Recall that for two random vectors X,Y with mean vectors m, n, the covari- 
ance matrix is Cov[X, Y]i; = E[(Xi —mi)(Yj —1n;)]. We say Cov[X, Y] = 0 
if this matrix is the zero matrix. 


Lemma 4.1.1. A Gaussian random vector (X,Y) with random vectors X,Y 
satisfying Cov[|X,Y] = 0 has the property that X and Y are independent. 


Proof. We can assume without loss of generality that the random variables 
X,Y are centered. Two R”-valued Gaussian random vectors X and Y are’ 
independent if and only if 


$x,v)(s,t) = dx(s) - dy (t), Vs, € R” 


Indeed, if V is the covariance matrix of the random vector X and W is the 
covariance matrix of the random vector Y, then 


e U ee | oie 0 | 


~ | Cov[Y, X] V ~|0 V 


is the covariance matrix of the random vector (X,Y). With r = (t,s), we 
have therefore 


Ele (%¥))] ls en 3(r'Ur) 


dx,v) (7) 
es (s-Vs)— 3(t-Wt) 
e7 3 (8'V8) 9-3 (t Wt) 


= x(s)y(t) . 


O 


Example. In the context of this lemma, one should mention that there 
exist uncorrelated normal distributed random variables X,Y which are not 
independent [109]: Proof. Let X be Gaussian on R and define for a > 0 the 
variable Y(w) = —X(w), ifw > aand Y = X else. Also Y is Gaussian and 
there exists a such that E[XY] = 0. But X and Y are not independent and 
X+Y = 0on [-a, a] shows that X+Y is not Gaussian. This example shows 
why Gaussian vectors (X,Y) are defined directly as R? valued random 
variables with some properties and not as a vector (X,Y) where each of 
the two component is a one-dimensional random Gaussian variable. 


Proposition 4.1.2. If X; is a Gaussian process with covariance V(s,t) = 
V(r,r) with r = min(s,¢), then it is Brownian motion. 
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Proof. By the above lemma (4.1.1), we only have to check that for all i < j 
Cov[Xt9, Xtj4, — Xt;] = 0, Cov[Xi,,, — Xt, Xtj4, — Xe] =0. 

But by assumption 
Cov[Xt9, Xt541 a Xz, | = V(to, tj41) - V(to,tj) = V (to, to) — V(to, to) = 0 


and 


Cov[Xti41 — Xt, Xt;4. — Xt,] V (tii, ty41) — V (titi, t5) 
—V (ti, ty41) + Vi, ty) 
= V(tit1, tit) — V (tess, ti41) 


—V(ti,ti) + V(t, ti) =0. 


j+1 


O 


Remark. Botanist Robert Brown was studying the fertilization process in a 
species of flowers in 1828. While watching pollen particles in water through 
a microscope, he observed small particles in ”rapid oscillatory motion”. 
While previous studies concluded that these particles were alive, Brown’s 
explanation was that matter is composed of small "active molecules” , which 
exhibit a rapid, irregular motion having its origin in the particles themselves 
and not in the surrounding fluid. Brown’s contribution was to establish 
Brownian motion as an important phenomenon, to demonstrate its presence 
in inorganic as well as organic matter and to refute by experiment incorrect 
mechanical or biological explanations of the phenomenon. The book [73] 
includes more on the history of Brownian motion. 


The construction of Brownian motion happens in two steps: one first con- 
structs a Gaussian process which has the desired properties and then shows 
that it has a modification which is continuous. 


Proposition 4.1.3. Given a separable real Hilbert space (H,|| - ||). There 
exists a probability space (Q,.A,P) and a family X(h),h € H of real-valued 
random variables on 2 such that h ++ X(h) is linear, and X (h) is Gaussian, 
centered and E[X(h)?] = ||hJ|?. 


Proof. Pick an orthonormal basis {e,} in H and attach to each e, a cen- 
tered Gaussian IID random variable X, € L? satisfying ||X,||2 = 1. Given 
a general h = >)‘ hnen € H, define 


X(h) = So hnXn 


which converges in £?. Because X, are independent, they are orthonormal 
in £? so that 


WX (h)I3 = S> 2 Xnlle = Sn? = |All. 
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Oo 


Definition. If we choose H = L?(R*,dzr), the map X : H ++ L? is 
also called a Gaussian measure. For a Borel set A C Rt we define then 
X(A) = X(14). The term ” measure” is warranted by the fact that X (A) = 
>, X (An) if A is a countable disjoint union of Borel sets An. One also has 
X (0) =0. 


Remark. The space X(H) c £? is a Hilbert space isomorphic to H and in 
particular 


E[X(h)X (h’)] = (A, h’) . 


We know from the above lemma that h and h’ are orthogonal if and only 
if X(h) and X(h’) are independent and that 


E[X(A)X(B)] = Cov[X(A), X(B)] = (14,18) = |AN BI. 


Especially X(A) and X(B) are independent if and only if A and B are 
disjoint. 


Definition. Define the process B; = X((0,t]). For any sequence fj, to,--- € 
T, this process has independent increments B;, — B;,_, and is a Gaussian 
process. For each t, we have E[B?] = t and for s < t, the increment B; — Bs 
has variance t — s so that 


E(B. Bi] = E[B?] + E[B,(B: — Bs)| = E[B2] = s. 


This model of Brownian motion has everything except continuity. 


Theorem 4.1.4 (Kolmogorov’s lemma). Given a stochastic process X¢ with 
t € [a,b] for which there exist three constants p > 7, K such that 


E[|Xt+n — Xel?] < K-hITT 


for every t,t +h &€ [a,b], then X; has a modification Y; which is almost 
everywhere continuous: for all s,t € [a, 5] 


[¥:(w) — ¥e(w)} < CW) |t- s|*,0<a< : 


Proof. We can assume without loss of generality that a = 0,b = 1 because 
we can translate and rescale the time variable to be in this situation. Define 
€ =r —ap. By the Chebychev-Markov inequality (2.5.4) 


P|Xezn — Xel] > [RIM] S [A PPE[|Xe4n — Xel?] S KAP TS 


so that 
PX esiyjos = Xpyon |S 2- | < KS. 
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Therefore 


co 27-1 


se > P[|X(e-41)/2" = Xx/2n| > 2 < oO. 
n=1 k=0 


By the first Borel-Cantelli’s lemma (2.2.2), there exists n(w) < oo almost 
everywhere such that for all n > n(w) and k =0,...,2"-1 


|X (&-+1)/2n (w) — Xp/an(w)| < 27". 
Let n > n(w) and t € [k/2”, (k+1)/2"] of the form t = k/2°+77, 94/2"** 
with y; € {0,1}. Then 


|X(w) — Xpo-n(w)] < So arnt) < dane 


i=1 
with d = (1 — 2-*)—!. Similarly 
[Xe — Xceqrya-n| <d2-"*. 
Given t,t +h € D = {k2-" | n € N,k = 0,...n— 1}. Take n so that 
2-"-1 < h < 2-" and k so that k/2"*! < t < (k+1)/2"t'. Then (k + 
1)/2"+1 <t+h< (k+3)/2"*! and 
\Xevn— Xe) = Vda Ne: < Bah”, 


For almost all w, this holds for sufficiently small h. 


We know now that for almost all w, the path X;(w) is uniformly continuous 
on the dense set of dyadic numbers D = {k/2"}. Such a function can be 
extended to a continuous function on [0,1] by defining 


\— Vy 
Yew) ae X.(w) : 


Because the inequality in the assumption of the theorem implies E[.X;(w) — 
limsep—t Xs(w)] = 0 and by Fatou’s lemma E[Y;(w) —limse p+ Xs(w)] = 0 
we know that X; = Y; almost everywhere. The process Y is therefore a 
modification of X. Moreover, Y satisfies 


[Y:(w) — Ys(w)| < Cw) |t — s|* 


for all s,t € [a, b]. oO 


Corollary 4.1.5. Brownian motion exists. 
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Proof. In one dimension, take the process B; from above. Since X, = 
Bitn — By is centered with variance h, the fourth moment is E[X4] = 


£, exp(—27h/2)i,-0 = 3h7, so that 
E|(Bi+n — Bt)*] = 3h? . 


Kolmogorov’s lemma (4.1.4) assures the existence of a continuous modifi- 
cation of B. 


To define standard Brownian motion in n dimension, we take the joint 
motion B,; = (Bo, oe Bw”) of n independent one- dimensional Brownian 
motions. O 


Definition. Let B; be the standard Brownian motion. For any z € R”, the 
process X¥ = x + B; is called Brownian motion started at z. 


The first rigorous construction of Brownian motion was given by Norbert 
Wiener in 1923. By construction of a Wiener measure on C(0, 1], one has 
a construction of Brownian motion, where the probability space is directly 
given by the set of paths. One has then the process X;(w) = w(t). We will 
come to this later. A general construction of such measures is possible given 
a Markov transition probability function {104]. The construction given here 
is due to Neveu and goes back to Kakutani. It can be found in Simon’s book 
on functional integration [93] or in the book of Revuz and Yor [83] about 
continuous martingales and Brownian motion. This construction has the 
advantage that it can be applied to more general situations. 


In McKean’s book ”Stochastic integrals” [66] one can find Lévy’s direct 
proof of the existence of Brownian motion. Because that proof gives an ex- 
plicit formula for the Brownian motion process B;, and is so constructive, 
we outline it shortly: 


1) Take as a basis in L?((0, 1] the Haar functions 
Fayn = 2-9/7 (Ligeaya—n,ka-m] — Lpea-n (h4+1)2-"]) 


for {(k,n)|n > 1,k < 2” } and foo = 1. 


2) Take a family Xin for (k,n) € I = {(k,n) | n > 1,k < 2",k odd }U 
{(0,0) } of independent Gaussian random variables. 


3) Define 


t 
B= YD) Xen f fan: 
0 


(kn)el 


4) Prove convergence of the above series. 


5) Check 


3 t 1 
E[B, Bi] = > I i fams= | 110,s}1jo,4) = inf{s, t } . 


(k,n)el 


198 Chapter 4. Continuous Stochastic Processes 


6) Extend the definition from t € [0, 1] to t € [0, 00) by taking independent 
Brownian motions B‘ and defining B, = eel B™ , where [t] is the 
largest integer smaller or equal to t. 


4.2 Some properties of Brownian motion 


We first want to establish that Brownian motion is unique. To do so, we 
first have to say, when two processes are the same: 


Definition. Two processes X; on (Q,.A,P) and X{ on (’, A’, P’) are called 
indistinguishable, if there exists an isomorphism U : 2 — ( of probability 
spaces, such that X{(Uw) = X;(w). Indistinguishable processes are consid- 
ered the same. A special case is if the two processes are defined on the same 
probability space (Q,A,P) and X;(w) = Y;(w) for almost all w. 


Proposition 4.2.1. Brownian motion is unique in the sense that two stan- 
dard Brownian motions are indistinguishable. 


Proof. The construction of the map H — L? was unique in the sense that 
if we construct two different processes X(h) and Y(h), then there exists an 
isomorphism U of the probability space such that X(h) = Y(U(h)). The 
continuity of X, and Y; implies then that for almost all w, X:(w) = ¥;(Uw). 
In other words, they are indistinguishable. O 


We are now ready to list some symmetries of Brownian motion. 


Theorem 4.2.2 (Properties of Brownian motion). The following symmetries 
exist: - 

(i) Time-homogeneity: For any s > 0, the process B, = By, — B, is a 
Brownian motion independent of o(By,u < s). 

(ii) Reflection symmetry: The process B, = —B, is a Brownian motion. 
(iii) Brownian scaling: For every c > 0, the process B, = cB, /c2 is a Brow- 
nian motion. 7 

(iv) Time inversion: The process By = 0, B; = tB, /tt > 0 is a Brownian 
motion. 


Proof. (i),(ii),(iii) In each case, B; is a continuous centered Gaussian pro- 
cess with continuous paths, independent increments and variance t. 
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(iv) B is a centered Gaussian process with covariance 
eae ae 4 ee ae ee 
Cov(B,, B:] = E[Bs, Bi] = st - E[Bi,., Bij:| = st inf(~, a = inf(s,t) . 


Continuity of B; is obvious for t > 0. We have to check continuity only for 
t = 0, but since E[B?] = s > 0 for s > 0, we know that B, — 0 almost 
everywhere. O 


It follows the strong law of large numbers for Brownian motion: 


a 


Theorem 4.2.3 (SLLN for Brownian motion). If B; is Brownian motion, 
then 


1 
lim — Br =0 
too t 
almost surely. 


nn TTS 


Proof. From the time inversion property (iv), we see that t "Be = Bit 
which converges for t + oo to 0 almost everywhere, because of the almost 
everywhere continuity of B;. Oo 


Definition. A parameterized curve t € [0,00) + X; € R” is called Holder 
continuous of order a if there exists a constant C such that 


|Xt+n — Xe|| <O-A* 


for all h > 0 and all t. A curve which is Hélder continuous of order a = 1 
is called Lipshitz continuous. 


The curve is called locally Hélder continuous of order a if there exists for 
each t a constant C = C(t) such that 


|Xt+n — Xel| SCA 
for all small enough h. For a R¢-valued stochastic process, (local) Holder 


continuity holds if for almost all w € 2 the sample path X;(w) is (local) 
Hélder continuous for almost all w € 2. 


Proposition 4.2.4. For every a < 1/2, Brownian motion has a modification 
which is locally Hélder continuous of order a. 


a 
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Proof. It is enough to show it in one dimension because a vector func- 
tion with locally Hélder continuous component functions is locally Hélder 
continuous. Since increments of Brownian motion are Gaussian, we have 


E(B: — Bz)’?| = Cp - |t — s/? 


for some constant C,. Kolmogorov’s lemma assures the existence of a mod- 
ification satisfying locally 


IB. Bul <CltsI°,0<a <P. 


Because p can be chosen arbitrary large, the result follows. O 


Because of this proposition, we can assume from now on that all the paths 
of Brownian motion are locally Holder continuous of order a < 1/2. 


Definition. A continuous path X; = 64g eee oa is called nowhere 


differentiable, if for all t, each coordinate function x is not differentiable 
at t. 


Theorem 4.2.5 (Wiener). Brownian motion is nowhere differentiable: for 
almost all w, the path t+ X¢(w) is nowhere differentiable. 


Proof. We follow [66]. It is enough to show it in one dimensions. Suppose 
B, is differentiable at. some point 0 < s < 1. There exists then an integer / 
such that |B, — B,| < l(t — s) for t— s > 0 small enough. But this means 
that 


l 
IByjn — Bg-1)/nl S77 
for all 7 satisfying 
i= [ns] +1<j< [ns] +4=714+3 


and sufficiently large n so that the set of differentiable paths is included in 
the set 


B=UUN Uf) (Bim-Bo-nml< 7 }. 


I>1 m>1 n>m0<i<n4+1i<j<it+3 
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Using Brownian scaling, we show that P(B] = 0 as follows 


Pf) U {1Byjn — By-ay/nl <7 } 


nom 0<i<n41i<j<i+3 


IA 


l 
lim inf nP[|By/n| < ‘al 
n—0oO 


= liminfnP[|B,| <7—= =) 


lA 
E 
| 
lt 

° 


O 


Remark. This proposition shows especially that we have no Lipshitz con- 
tinuity of Brownian paths. A slight generalization shows that Brownian 
motion is not Hdlder continuous for any a > 1/2. One has just to do the 
same trick with k instead of 3 steps, where k(a — 1/2) > 1. The actual 
modulus of continuity is very near to a = 1/2: |B: — Br+.| is of the order 


Oe 1 2¢log(=) 


More precisely, Plimsup,_,9 SUP|._1)<¢ 
later in theorem (4.4.2). 

The covariance of standard Brownian motion was given by E[B,By] = 
min{s,t}. We constructed it by implementing the Hilbert space L?({0, 00)) 
as a Gaussian subspace of £?(,.A, P). We look now at a more general class 
of Gaussian processes. 


BF = 1] = 1, as we will see 


Definition. A function V : T x T — R is called positive semidefinite, 
if for all finite sets {t,,...,ta} C T, the matrix Vi; = V(ti,t;) satisfies 
(u, Vu) > 0 for all vectors u = (u1,..., Un). 


Proposition 4.2.6. The covariance of a centered Gaussian process is positive 
semidefinite. Any positive semidefinite function V on T xT is the covariance 
of a centered Gaussian process X;. 


Proof. The first statement follows from the fact that for all u = (ui,...,un) 


S°V (ti, tiuiuy = E| (la) deas 
oN] 


We introduce for t € T a formal symbol 6;. Consider the vector space of 
finite sums ye: , 464, with inner product 


oS ai0t;; 3 bjt; ) = > aibjV (ti, t;) 
i=l j=l tj 
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This is a positive semidefinite inner product. Multiplying out the null vec- 
tors {||v|| = 0 } and doing a completion gives a separable Hilbert space 
H. Define now as in the construction of Brownian motion the process 
X, = X (61). Because the map X : H > L? preserves the inner product, we 
have 

E[Xt, Xs| = (5s, Ot) = V(s,t) . 


Lets look at some examples of Gaussian processes: 


Example. The Ornstein-Uhlenbeck oscillator process X; is a one-dimensional 
process which is used to describe the quantum mechanical oscillator as we 
will see later. Let T = R* and take the function V(s,t) = }e7!*-s! on 
T xT. We first show that V is positive semidefinite: The Fourier transform 
of f(t) =e"! is 

1 


ikt —ltlde 2 . 
fe : In(k2 + 1) 


By Fourier inversion, we get 
1 ; 1 
ay k? 1 —1_ik(t—s) dk=-= —|t—s| : 
Qn rk ee 2° 


and so 


—) 
IA 


(2m)? [we $1)? Suet |? ak 
R 5 
3 


“ 1 
= y Ujupae tested 
2 
jk=1 


This process has a continuous modification because 
El(X_ — Xq)?] = (e7#-#l + eee! — 2e7Ht-#l)/2 = (1 eH") < |t — | 


and Kolmogorov’s criterion. The Ornstein-Uhlenbeck is also called the os- 
cillatory process. 


Proposition 4.2.7. Brownian motion B; and the Ornstein-Uhlenbeck pro- 
cess O; are for t > 0 related by 


1 
O: = —ze *By2 . 


v2 


Proof. Denote by O the Ornstein-Uhlenbeck process and let 


X, = 271 2e-* Beat 
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We want to show that X = Y. Both X and O are centered Gaussian, 
continuous processes with independent increments. To verify that they are 
the same, we have to show that they have the same covariance. This is a 
computation: 


1 
E[O,05] = Ree min{e*’,e?* } = elstl /2 ; 


It follows from this relation that also the Ornstein-Uhlenbeck process is 
not differentiable almost everywhere. There are also generalized Ornstein- 
Uhlenbeck processes. The case V(s,t) = fpe7*-*) du(k) = filt - s) 
with the Cauchy measure p = Iara at on R can be generalized to take 
any symmetric measure on R and let f denote its Fourier transform 
fae" du(k). The same calculation as above shows that the function 
V(s,t) = fi(t — s) is positive semidefinite. 


Figure. Three paths of the 
Ornstein- Uhlenbeck process. 


Example. Brownian bridge is a one-dimensional process with time 7’ = 
[0,1] and V(s,t) = s(1—t) for 1 <s <t<1 and V(s,t) = V(t,s) else. I 
is also called tied down process. 


ct 


In order to show that V is positive semidefinite, one observes that X, 
B, — sB, is a Gaussian process, which has the covariance 


E[X,X;] = E[(B, — sB1)(B: — tB,)j = s + st — 2st = s(1—t). 


Since E[X?] = 0, we have X; = 0 which means that all paths start from 0 
at time 0 and end at 1 at time 1. 

The realization X; = B, — sB, shows also that X; has a continuous real- 
ization. 
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Figure. Three paths of Brownian 
bridge. 


Let X; be the Brownian bridge and let y be a point in R?. We can consider 
the Gaussian process Y; = ty + X; which describes paths going from 0 at 
time 0 to y at time 1. The process Y has however no more zero mean. 
Brownian motion B and Brownian bridge X are related to each other by 
the formulas: 


= B: == (t+ 1) Xt/(t41) Xi = xX; aL t) Brat) ‘ 


These identities follow from the fact that both are continuous centered 
Gaussian processes with the right covariance: 


E[B,B,] 


(t+ 1)(s + 1) min{-——-~ } = min{s,t} = E[B, By] , 


oy wy 
qe a } = s(1 aa t) = E[X,Xz| 


and uniqueness of Brownian motion. 


E[X,X;] (1 — t)(1 — s) min{ 


Example. If V(s,t) = 1,.-1}, we get a Gaussian process which has the 
property that X, and X; are independent, if s 4 t. Especially, there is no 
autocorrelation between different X, and X;. This process is called white 
noise or ’great disorder”. It can not be modified so that (t,w) > X;(w) is 
measurable: if (t,w) ++ X,(w) were measurable, then Y; = in X, ds would 
be measurable too. But then 


E{Y?] = Bf xy\= : a E[X4/Xq'] ds’ ds = 0 


which implies Y; = 0 almost everywhere so that the measure du(w) = 
X,(w) ds is zero for almost all w. 


tant fx =n f XX, ds] =u fx, du(s)] =0. 


In a distributional sense, one can see Brownian motion as a solution of 
the stochastic differential equation and white noise as a generalized mean- 
square derivative of Brownian motion. We will look at stochastic differential 
equations later. 
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Example. Brownian sheet is not a stochastic process with one dimensional 
time but a random field: time JT = R32 is two dimensional. Actually, as 
long as we deal only with Gaussian random variables and do not want to 
tackle regularity questions, the time T can be quite arbitrary and proposi- 
tion (4.2.6) stated at the beginning of this section holds true. The Gaussian 
process with 


V((s1, 82), (ti, t2)) = min(s1, ti) - min(s2, t2) 


is called Brownian sheet. It has similar scaling properties as Brownian mo- 
tion. 


Figure. /Ilustrating a sample of a 
Brownian sheet B,.,. Time is two 
dimensional. Every trace By = 
Bis, oT Be = Bt,sy 18 standard 
Brownian motion. 


4.3 The Wiener measure 


Let (E,€) be a measurable space and let T be a set called “time”. A 
stochastic process on a probability space (Q,.A,P) indexed by T and with 
values in E defines a map 


@:2 3 Ewe Xi(w). 


The product space E? is equipped with the product o-algebra €", which 
is the smallest algebra for which all the functions X; are measurable which 
is the g-algebra generated by the 7-system 


{ [[ Au ={z€ £722, € A} | An € &} 
Bi gsae5 tn 
consisting of cylinder sets. Denote by Y;(w) = w(t) the coordinate maps on 
ET’. Because Y; 0 @ is measurable for all t, also ¢ is measurable. Denote by 
Px the push-forward measure of @ from (Q,.A,P) to (ET, &7) defined by 
Px [A] = P[X~!(A)]. For any finite set (ti,...,tn) C T and all sets A; € €, 
we have 
P[Xz, € Aj,i= | | = Px (Yi, €A;,1= j ere | ‘ 


One says, the two processes X and Y are versions of each other. 


206 Chapter 4. Continuous Stochastic Processes 


Definition. Y is called the coordinate process of X and the probability 
measure Px is called the law of X. 


Definition. Two processes X, X’ possibly defined on different probability 
spaces are called versions of each other if they have the same law Px = Px, 


One usually does not work with the coordinate process but prefers to work 
with processes which have some continuity properties. Many processes have 
versions which are right continuous and have left hand limits at every point. 


Definition. Let D be a measurable subset of ET and assume the process has 
a version X such that almost all paths X (w) are in D. Define the probability 
space (D,E7 N D,Q), where Q is the measure Q = ¢*P. Obviously, the 
process Y defined on (D,€? M D, Q) is another version of X. If D is right 
continuous with left hand limits, the process is called the canonical version 
of X. 


rr 


Corollary 4.3.1. Let E = R? and T = R*. There exists a unique probability 
measure W on C(T, E) for which the coordinate process Y is the Brownian 
motion B. 


Proof. Let D = C(T,E) C E™. Define the measure W = ¢*Px and let 
Y be the coordinate process of B. Uniqueness: assume we have two such 
measures W, W’ and let Y,Y’ be the coordinate processes of B on D with 
respect to W and W’. Since both Y and Y’ are versions of X and ”being 
a version” is an equivalence relation, they are also versions of each other. 
This means that W and W’ coincide on a 7- system and are therefore the 
same. O 


Definition. If E = R* and T = {0,00), the measure W on C(T, E) is called 
the Wiener measure. The probability space (C(T, E),E7 N C(T, E),W) is 
called the Wiener space. 


Let B’ be the o-algebra €7 M C(T, E), which is the Borel o-algebra re- 
stricted to C(T, E). The space C(T, E) carries an other o-algebra, namely 
the Borel o-algebra B generated by its own topology. We have B c B’, 
since all closed balls {f € C(T, E) | |f — fo| <r} € B are in B’. The other 
relation B’ C B is clear so that B = B'. The Wiener measure is therefore a 
Borel measure. 


Remark. The Wiener measure can also be constructed without Brownian 
motion and can be used to define Brownian motion. We sketch the idea. 
Let S = R” denote the one point compactification of R”. Define 2 = $12] 
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be the set of functions from [0, t] to S which is also the set of paths in R”. 
It is by Tychonov a compact space with the product topology. Define 


Cyin(Q) = {@ € C(Q,R) | SF : R® > R, ow) = F(w(t1),.--,W(tn))} - 


Define also the Gauss kernel p(x, y, t) = (4nt)~"/? exp(—|z—y|?/4t). Define 
on Cfin(Q) the functional 
(L¢@)(s1,---,8m) a [ F(21,22,..-;2m)p(0, £1, $1)p(21, £2, $2) 
Rz)m 
*++P(LZm—-1,Lmy $m) AL, +++ dLm 

with s; = t, and sy, = ty —ty-1 for k > 2. Since L() < |(w)|oo, it 
is a bounded linear functional on the dense linear subspace Cyin(Q) C 
C(Q). It is nonnegative and L(1) = 1. By the Hahn Banach theorem, it 
extends uniquely to a bounded linear functional on C(Q). By the Riesz 


representation theorem, there exists a unique measure js on C(Q) such that 
L(¢) = f ¢(w) du(w). This is the Wiener measure on (2. 


4.4 Lévy’s modulus of continuity 


We start with an elementary estimate 


Lemma 4.4.1. 


Proof. 
oo 2 oo 2 1 2 
i eal dx < [ e* /?(z/a) dx =e? /? 
a a a 
For the right inequality consider 
oo co 
1 —b? /2 1 / some 2 
=e db < — en? /? dy, 
| 


Integrating by parts of the left hand side of this gives 


co oO 
1 6-0/2 -f et /2 dg < al ent /2 dy | 
a i a? J 
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Theorem 4.4.2 (Lévy’s modulus of continuity). If B is standard Brownian 
motion, then 


* |B. me B,| 
Pilimsup sup —-——\— 
«70 |s—t|<e h(e) 


where h(e) = ./2¢ log(1/e). 


ak a 


Proof. We follow [83]: 


(i) Proof of the inequality ”> 1”. 
Take 0 < 6 < 1. Define a, = (1 — 5)h(2~") = (1 — 6)/n2 log 2. Consider 


P[A,] = es |Bro-» — Br-1)2-"| < ay] . 


Because By/an — Bcr—1)/2" are independent Gaussian random variables, we 
compute, using the above lemma (4.4.1) and 1—s <e~* 


ta 1 2 n 
P/A < 1-2 / peg 14 ape 
[ nl = ( hes Jor ) 

an ant n 

< —9 ay, /2)2 

Ss ( aaa ) 

< exp(—2"222_¢-#n/2) < eC exP(n(1=(1-8)*)/ VA 
at#+1 


where C is a constant independent of n. Since >, P[An] < 00, we get by 
the first Borel-Cantelli that P{lim sup, An] = 0 so that 


i —n n > ms = . 
P[ lim, max, |Bro-» — Buray2 |2A2™)) =1 


(ii) Proof of the inequality ”< 1”. 
Take again 0 < 6 < 1 and pick € > 0 such that (1+ )(1— 6) > (1+). 
Define 


P[An] P[_ max |Bjg—-n — Big-n|/h(k2-") > (1 +6) 


k=j—ieK 
= P U {|Bjo-n — Byg-n |] > ane} 5 
k=j-i¢K 


where 


K={0<k<2™} 


and @n,~ = h(k2~-")(1 + €). 
Using the above lemma, we get with some constants C' which may vary 
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from line to line: 


P{[Ay] < So azhenone!? 


keK 
< C- S- log(k722") “M297 G+)? log(k'2”) 
kek 
< C: g—n(1—8)(1+6)” J (log(k*2")) "07? (since k~* > on) 


keK 
ae oe no d/2gn(6—U-6)(146)”) ; 


In the last step was used that there are at most 2° points in K and for 
each of them log(k~!2") > log(2"(1 — )). 
We see that >, P[An] converges. By Borel-Cantelli we get for almost every 
w an integer n(w) such that for n > n(w) 


|Byo-n = Byg-n| < (1 + €) : h(k2~") 5 
where k = j —i € K. Increase possibly n(w) so that for n > n(w) 


S5 A(2-™) <e€- Alara) 


mr>n 


Pick 0 < ti < te < 1 such that t = t2-ti < 2-)-5) Take next 
n > n(w) such that g-(n+1)(1-8) < ¢ < 2-7-9) and write the dyadic 
development of £1, t2: 


fy SAP SSP Ge OD 2 st 
with t; <i2-" <j2-" < ty and 0 <k=j-i< t2” <2". We get 


|Be, (w) il Bey (w)| 


IA 


|Bi, — Big-n(w)| 

+|Big-n(w) — Bja-»(w)| 

+|Bjo-»(w) — Bi, 

< 25 (1 t+e)h(2-?) + (1+ 6)h(k2™) 
pon 

< (1+3¢+ 2€?)A(t) . 


Because € > 0 was arbitrary, the proof is complete. OD 


4.5 Stopping times 


Stopping times are useful for the construction of new processes, in proofs 
of inequalities and convergence theorems as well as in the study of return 
time results. A good source for stopping time results and stochastic process 
in general is [83]. 


Definition. A filtration of a measurable space (, A) is an increasing family 
(At)e>o of sub-o-algebras of A. A measurable space endowed with a filtra- 
tion (At)e>0 is called a filtered space. A process X is called adapted to the 
filtration A:, if X_ is A¢-measurable for all ¢. 
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Definition. A process X on (Q,A,P) defines a natural filtration At = 
o(X, | s < t), the minimal filtration of X for which X is adapted. Heuris- 
tically, A; is the set of events, which may occur up to time t. 


Definition. With a filtration we can associate two other filtration by setting 
for t > 0 
Ar- = 0(As,8 <t), Ar = () Ay. 


s>t 


For t = 0 we can still define Aj+ = Ms>0-As and define Ap- = Ap. Define 
also Aw = o(As,s > 0). 


Remark. We always have A,- C A; C A,+ and both inclusions can be 
strict. 


Definition. If A; = A,+ then the filtration At is called right continuous. If 
A; = A;-, then .A; is left continuous. As an example, the filtration A,+ of 
any filtration is right continuous. 


Definition. A stopping time relative to a filtration A: isamapT:92 > 
[0, co] such that {T <t} © Ay. 


Remark. If A; is right continuous, then T is a stopping time if and only 
if {T <t} €.A;. Also T isa stopping time if and only if X; = lo,r\(t) is 
adapted. X is then a left continuous adapted process. 


Definition. If T is a stopping time, define 
Ar = {AE Ax | AN{T <t} € Ai, Vt} . 


It is a o-algebra. As an example, if T = s is constant, then Ay = A,. Note 
also that 
Ar+ = {A€ Aw | AN{T <t} € A, Vt}. 


We give examples of stopping times. 
eee 
Proposition 4.5.1. Let X be the coordinate process on C(R,, £), where E 
is a metric space. Let A be a closed set in E. Then the so called entry time 
Ta(w) = inf{t >0| X:(w) € A} 
is a stopping time relative to the filtration A, = o({Xs }s<t). 
eee 
Proof. Let d be the metric on E. We have 


epee {, inf_,d(X.(w), A) = 0} 


which is in A; = o(X,,2 < t). oO 


4.5. Stopping times 211 


Proposition 4.5.2. Let X be the coordinate process on D(R4, £), the space 
of right continuous functions, where E is a metric space. Let A be an open 
subset of E. Then the hitting time 


Ta(w) = inf{t >0| Xi(w) € A} 
is a stopping time with respect to the filtration A;+. 


NN 


Proof. T4 is a Ay+ stopping time if and only if {T4 < t} € At for all t. 
If A is open and X,(w) € A, we know by the right-continuity of the paths 
that X,(w) € A for every t € [s,s + €) for some € > 0. Therefore 


(Ta<t}={ inf XeeA}eAr. 


O 


Definition. Let A; be a filtration on (Q,A) and let T be a stopping time. 
For a process X, we define a new map X7 on the set {T < co} by 


Xr(w) = Xru(w) - 


Remark. We have met this definition already in the case of discrete time 
but in the present situation, it is not clear whether X7 is measurable. It 
turns out that this is true for many processes. 


Definition. A process X is called progressively measurable with respect to a 
filtration A; if for all t, the map (s,w) ++ X.(w) from ((0, ]xQ, B({0, t] x At) 
into (E,€) is measurable. 


A progressively measurable process is adapted. For some processes, the 
inverse holds: 


a eee 


Lemma 4.5.3. An adapted process with right or left continuous paths is 
progressively measurable. 


NN ——$———————————— 


Proof. Assume right continuity (the argument is similar in the case of left 
continuity). Write X as the coordinate process D((0, t], £). Denote the map 
(s,w) + X,(w) with Y = Y(s,w). Given a closed ball U € €. We have to 
show that Y~?(U) = {(s,w) | ¥(s,w) € U} € B((0,t]) x Ar. Given k = N, 
we define Eo,y = 0 and inductively for k > 1 the k’th hitting time (a 
stopping time) 


Hx.y(w) = inf{s € Q | Ex-1u(w) <s<t, Xs EU } 
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as well as the k’th exit time (not necessarily a stopping time) 
Ex,u(w) = inf{s E Q|Hx,v(w) <s<t, Xs ¢ U} ‘ 


These are countably many measurable maps from D((0, t], E) to [0, t]. Then 
by the right-continuity 


¥~"'(U) = U{(s,w) | Aku) < 8 < Exy(w)} 
k=1 


which is in B((0, t]) x A;. Oo 


oo eee 


Proposition 4.5.4. If X is progressively measurable and T is a stopping 
time, then Xr is Ar-measurable on the set {T < oo}. 


eee 


Proof. The set {T < oo} is itself in Ap. To say that X7 is Ar- measurable 
on this set is equivalent with X7 - lir<t} € At for every t. But the map 


S:({T < t}, An {T < t}) — ((0, t], B[O, t}) 


is measurable because T is a stopping time. This means that the map 
w+ (T(w),w) from (2, Az) to ((0, f] x 2, B((0, ¢]) x Az) is measurable and 
Xr is the composition of this map with X which is B [0, t] x A, measurable 
by hypothesis. O 


Definition. Given a stopping time T and a process X, we define the stopped 
process (X7),(w) = XTat(w). 


Remark. If A; is a filtration then Atar is a filtration since if T, and T> are 
stopping times, then T; A T> is a stopping time. 


=e 


Corollary 4.5.5. If X is progressively measurable with respect to A; and 
T is a stopping time, then (Xr) = Xtar is progressively measurable with 
respect to Aap. 


eee 


Proof. Because tA T is a stopping time, we have from the previous propo- 
sition that X7 is A:,7 measurable. 

We know by assumption that ¢ : (s,w) ++ X.(w) is measurable. Since also 
w : (s,w) ++ (s AT)(w) is measurable, we know also that the composition 
(s,w) > Xr(w) = Xyo,u)(w) = o((s,w),w) is measurable. O 
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foe eS ee ea 


Proposition 4.5.6. Every stopping time is the decreasing limit of a sequence 
of stopping times taking only finitely many values. 


Iii 


Proof. Given a stopping time T, define the discretisation T, = +-oo iff >k 
and T; = q2-* if (¢q—1)2-* <T <q Q-* q¢ < 2*k. Each T;, is a stopping 
time and T; decreases to T’. O 


Many concepts of classical potential theory can be expressed in an elegant 
form in a probabilistic language. We give very briefly some examples with- 
out proofs, but some hints to the literature. 


Let B; be Brownian motion in R¢ and T, the hitting time of a set A C R¢. 
Let D be a domain in R¢ with boundary 6(D) such that the Green function 
G(zx,y) exists in D. Such a domain is then called a Green domain. 


Definition. The Green function of a domain D is defined as the fundamental 
solution satisfying AG(z, y) = 6(x—y), where 6(2—y) is the Dirac measure 
at y € D. Having the fundamental solution G, we can solve the Poisson 
equation Au = v for a given function v by 


“= [cen -v(y) dy. 


The Green function can be computed using Brownian motion as follows: 


Gte.u) = [ altsay) at, 
where for x € D, 
[otew dy = P,[Bt € C,Tsp > t] 
and P., is the Wiener measure of B; starting at the point 2. 


We can interpret that as follows. To determine G(x, y), consider the killed 
Brownian motion B; starting at x, where T is the hitting time of the bound- 
ary. G(z, y) is then the probability density, of the particles described by the 
Brownian motion. 


Definition. The classical Dirichlet problem for a bounded Green domain 
D € R¢ with boundary 6D is to find for a given function f € C(d(D)), a 
solution u € C(D) such that Au = 0 inside D and 


lim__u(e) = f(y) 


z—y,cED 


for every y € 6D. 
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This problem can not be solved in general even for domains with piecewise 
smooth boundaries if d > 3. 


Definition. The following example is called Lebesgue thorn or Lebesgue 
spine has been suggested by Lebesgue in 1913. Let D be the inside of a 
spherical chamber in which a thorn is punched in. The boundary 6D is 
held on constant temperature f, where f = 1 at the tip of the thorn y 
and zero except in a small neighborhood of y. The temperature u inside 
D is a solution of the Dirichlet problem Apu = 0 satisfying the boundary 
condition u = f on the boundary 6D. But the heat radiated from the thorn 
is proportional to its surface area. If the tip is sharp enough, a person sitting 
in the chamber will be cold, no matter how close to the heater. This means 
liminfsy,cep u(x) < 1 = f(y). (For more details, see [43, 46]). 


Because of this problem, one has to modify the question and one Says, u is 
a solution of a modified Dirichlet problem, if u satisfies Apu = 0 inside D 
and limzy,2ep u(x) = f(y) for all nonsingular points y in the boundary 
6D. Irregularity of a point y can be defined analytically but it is equivalent 
with Py[Tp- > 0] = 1, which means that almost every Brownian particle 
starting at y € 6D will return to 6D after positive time. 


OO eee 


Theorem 4.5.7 (Kakutani 1944). The solution of the regularized Dirichlet 
problem can be expressed with Brownian motion B; and the hitting time 
T of the boundary: 


u(x) = Es[f(Br)] - 


In words, the solution u(x) of the Dirichlet problem is the expected value 
of the boundary function f at the exit point Br of Brownian motion B; 
starting at x. We have seen in the previous chapter that the discretized 
version of this result on a graph is quite easy to prove. 


Figure. To solve the Dirichlet 
problem in a bounded domain 
with Brownian motion, start the 
process at the point x and run it 
until it reaches the boundary Br, 
then compute f(Br) and aver- 
age this random variable over all 
paths w. 
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Remark. Ikeda has discovered that there exists also a probabilistic method 
for solving the classical von Neumann problem in the case d = 2. For more 
information about this, one can consult (43, 79]. The process for the von 
Neumann problem is not the process of killed Brownian motion, but the 
process of reflected Brownian motion. 


Remark. Given the Dirichlet Laplacian A of a bounded domain D. One 
can compute the heat flow e~'4u by the following formula 


(e*4u)(z) = Ex[u(Br);t < 7], 


where T is the hitting time of 6D for Brownian motion B; starting at z. 


Remark. Let K be a compact subset of a Green domain D. The hitting 
probability 
p(x) = P,[TK < Tsp} 


is the equilibrium potential of K relative to D. We give a definition of the 
equilibrium potential later. Physically, the equilibrium potential is obtained 
by measuring the electrostatic potential, if one is grounding the conducting 
boundary and charging the conducting set B with a unit amount of charge. 


4.6 Continuous time martingales 


Definition. Given a filtration A; of the probability space (, A, P). A real- 
valued process X; € CL’ which is A; adapted is called a submartingale, if 
E[X:|As] > Xs, it is called a supermartingale if —X is a submartingale 
and a martingale, if it is both a super and sub-martingale. If additionally 
X, € LP for all t, we speak of £L? super or sub-martingales. 


We have seen martingales for discrete time already in the last chapter. 
Brownian motion gives examples with continuous time. 


Proposition 4.6.1. Let B; be standard Brownian motion. Then B,, B?-t 
and e%3:—2°t/2 are martingales. 


Proof. B,— By is independent of B,. Therefore 
E[B: | As] — Bs = E[B; — Bs|A,] = E[B, — B,] = 0. 
Since by the ” extracting knowledge” property 
E[B:B, | As] = Bs -E[B:| As] =0, 
we get 


E(B? — t | As] — (BS - 8) 


E[B; — BS | As] ~ (t — 8) 
E((B; — Bs)? | As] — (ts) =0. 


Hl 
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Since Brownian motion begins at any time s new, we have 
Efe%(2-8s)| A.) = Ele*-+] = eo (t-s)/2 


from which , : 
Ele*+|A,]e-% t/2 _ Ele*#*]Je-% 8/2 


follows. O 


As in the discrete case, we remark: 


Proposition 4.6.2. If X; is a C?-martingale, then |X;|? is a submartingale 
for p > 1. 


Proof. The conditional Jensen inequality gives 
E[|X¢|?|As] = |E[Xe| Aa]? = |Xs]? - 
C 


Example. Let X,, be a sequence of IID exponential distributed random 
variables with probability density fx (x) = e~“"c. Let Sn = S>p_, Xk. The 
Poisson process N; with time T = Rt = [0, 00) is defined as 


oo 
N: = ye 1s, <t . 
k=1 


It is an example of a martingale which is not continuous, This process 
takes values in N and measures, how many jumps are necessary to reach 
t. Since E[N;] = ct, it follows that N; — ct is a martingale with respect to 
the filtration A, = o(Ngz,s < t). It is a right continuous process. We know 
therefore that it is progressively measurable and that for each stopping 
time T,, also N7 is progressively measurable. See [49] or the last chapter 
for more information about Poisson processes. 


Figure. The Poisson point pro- 

cess on the line. N; is the num- 

ber of events which happen up to SS$sS $ SSS fo hs 
time t. It could model for exam- Utphgttg tag ght ag tea 
ple the number N, of hits onto a 

website. 
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Proposition 4.6.3. (Interval theorem) The Poisson process has independent 
increments 


lo @) 
Ni = Ns = Se ls<Sy<t : 
n=1 


Moreover, N; is Poisson distributed with parameter tc: 


(tc)* 
k! 


ete 


P(N: = k| = 


Proof. The proof is done by starting with a Poisson distributed process N¢. 
Define then 
Sn(w) = {t| Ne=n,M-o =n-1} 


and show that X, = S, — Sn—1 are independent random variables with 
exponential distribution. O 


Remark. Poisson processes on the lattice Z¢ are also called Brownian mo- 
tion on the lattice and can be used to describe Feynman-Kac formulas for 
discrete Schrédinger operators. The process is defined as follows: take X; 
as above and define 


[o@) 
Y; = ye Zrls,<t ’ 
k=1 


where Z,, are IID random variables taking values in {m € Z*||m| = 1}. 
This means that a particle stays at a lattice site for an exponential time 
and jumps then to one of the neighbors of n with equal probability. Let 
P,, be the analog of the Wiener measure on right continuous paths on the 
lattice and denote with E,, the expectation. The Feynman-Kac formula for 
discrete Schrodinger operators H = Ho + V is 


(e~*# u)(n) = e2ttB, [u(X)iMte* So V(Xs) cd : 


4.7 Doob inequalities 


We have already established inequalities of Doob for discrete times T = N. 
By a limiting argument, they hold also for right-continuous submartingales. 


Theorem 4.7.1 (Doob’s submartingale inequality). Let X be a non-negative 
right continuous submartingale with time T = [a,b]. For any « > 0 


e-P[ sup Xz >] < E[Xp;{ sup X; > €}] < E[Xo| . 
<b a<t<b 


axts 
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Proof. Take a countable subset D of T and choose an increasing sequence 
D,, of finite sets such that U, Da = D. We know now that for all n 


e-P[sup X; >] < E[Xp;{ sup X; > e}] < E[X] . 
teD, teD, 


since E[X;] is nondecreasing in t. Going to the limit n > 00 gives the claim 
with T = D. Since X is right continuous, we get the claim for T = [a,b]. O 


One often applies this inequality to the non-negative submartingale |X| if 
X is a martingale. 


OO 


Theorem 4.7.2 (Doob’s L? inequality). Fix p > 1 and q satisfying p~! + 
q'! = 1. Given a non-negative right-continuous submartingale X with 
time T = [a, 6] which is bounded in L?. Then X* = sup,ep X; is in L? and 
satisfies 


|X" lp Sa sup ||Xel|p - 
teT 


—_—-_---———————_—_—s a eeSSSSSFSFSSSSSSFmffheFeFeFe 


Proof. Take a countable subset D of T and choose an increasing sequence 
D,, of finite sets such that U,, Dn = D. 
We had 


|| sup Xe|| <q- sup ||Xzllp . 
teDn t€Dn 
Going to the limit gives 
|| sup X¢|| < q- sup ||Xellp . 
teD teD 


Since D is dense and X is right continuous we can replace D by T. O 


The following inequality measures, how big is the probability that one- 
dimensional Brownian motion will leave the cone {(t, x), |x| < a- t}. 


Theorem 4.7.3 (Exponential inequality). 9; = SUPo<s<z Bs satisfies for any 
a>0 Z 
P[S, >a-t]<e? */?. 


a2 . 
Proof. We have seen in proposition (4.6.1) that M, = e%8:—°s* is a mar- 
tingale. It is nonnegative. Since 
2 


2 2 
t t s 
exp(aS, — <~) < exp(sup B, - ““) < supexp(B, - “*) = sup M, , 
2 s<t 2 s<t 2 s<t 
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we get with Doob’s submartingale inequality (4.7.1) 


att 
P[S; >at] < P{supM, > e%*~*? | 
sct 


a’t 
< exp(—aat + > ELM] ‘ 


The result follows from E[B;] = E[Bo] = 1 and infgso exp(—aat + art )= 


exp(— : 0 
An other corollary of Doob’s maximal inequality will also be useful. 
Corollary 4.7.4. For a,b > 0, 
P[ sup (B, - =) > Bl <e-%. 
s€ [0,1] 2 
Proof. 
P[ sup (B; -S) 24) < P[ sup (B; -S)>4 
s€[0,1] s€[0,1] 
= P| sup (e%Bo~ 93") > ef) 
s€[0,1] 
= P[sup M,> eP) 
s€(0,1} 
< e 9 sup E[M,] = e~F* 
s€(0,1] 
since E[M,] = 1 for all s. 0 


4.8 Khintchine’s law of the iterated logarithm 


Khinchine’s law of the iterated logarithm for Brownian motion gives a pre- 
cise statement about how one-dimensional Brownian motion oscillates in a 
neighborhood of the origin. As in the law of the iterated logarithm, define 


A(t) = V 2tlog |logt| . 


Theorem 4.8.1 (Law of iterated logarithm for Brownian motion). 


, By 
P[l — = = 
Lim sup NG) 1 =1, Pllimint St -1lj=1 


xO = 
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Proof. The second statement follows from the first by changing B, to —B. 


(i) limsup,_.o Key < 1 almost everywhere: 
Take 0,6 € (0,1) and define 
nm 
Qn = (14+ 6)0-"A(O"), By = Ae ) ‘ 
We have On Bn = log log(@")(1 + 5) = log(n) log(@). From corollary (4.7.4), 
we get 


Pisup(B, — 925) > By] < e720" = Kn“) 
s<l 2 
The Borel-Cantelli lemma assures 


es Ons 2 
Rie een 5 )\'< Bj =1 


which means that for almost every w, there is no(w) such that for n > no(w) 
and s € [0,6"~'), 


8 oe? (1+) 
ae < —— = 
B,(w) < ans + Bn < On 5} + Bn = ( 0 


Since A is increasing on a sufficiently small interval [0,a), we have for 
sufficiently large n and s € (8",0"~"| 


Baw) < (29 


In the limit @ — 1 and 6 — 0, we get the claim. 


1 nm 
+ 5)A(6"). 


+ 5)A(s) ; 


(ii) lim sup,_.o 0} > 1 almost everywhere. 
For 6 € (0,1), the sets 


An = {Bon — Bon+1 > (1 — VO)A(0")} 


are independent and since Bgn — Bgn+: is Gaussian we have 


ce 
“ —u?/2 du a —a?/2 
P[An] / e FF > aa 1° 
with a = (1 — V@)A(0") < Kn~® with some constants K anda < 1. 
Therefore 5>,, P[An] = co and by the second Borel-Cantelli lemma, 
Bon > (1— VO@)A(0") + Bons (4.1) 


for infinitely many n. Since —B is also Brownian motion, we know from (i) 
that 
— Bons < 2A(0"*") (4.2) 


for sufficiently large n. Using these two inequalities (4.1) and (4.2) and 
A(o"+1) < 2V@A(6") for large enough n, we get 


Bon > (1 — VO)A(6") - 4A(6"*#) > A(0")(1 — VO — 4V0) 
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for infinitely many n and therefore 


> = ‘ 
limint 7 limsup eo > >1-5Vv0 


The claim follows for 6 — 0. DO 


Remark. This statement shows also that B; changes sign infinitely often 
for t — 0 and that Brownian motion is recurrent in one dimension. One 
could show more, namely that the set {B; = 0 } is a nonempty perfect set 
with Hausdorff dimension 1/2 which is in particularly uncountable. 


By time inversion, one gets the law of iterated logarithm near infinity: 


Corollary 4.8.2. 


Pilim sup 


sae ctl Plliminf St = — —1)=1. 


A(t) 


Proof. Since B= tBi/t (with Bo = 0) is a Brownian motion, we have with 
s=1/t 


1 = limsup —— = lim sine - 
s—0 at ) sso A(s) 


= lim sup ———— = lim sup — 


B, Bi 
too tA(1/t) too A(t) 


The other statement follows again by reflection. O 


Corollary 4.8.3. For d-dimensional Brownian motion, one has 


P{lim sup —— 


ae 0) 21] 1; Plliminf 7+ 


Chase aie 


Proof. Let e be a unit vector in R¢. Then B; -e is a 1-dimensional Brown- 
ian motion since B; was defined as the product of d orthogonal Brownian 
motions. From the previous theorem, we have 


Since B;-e < |B;|, we know that the limsup is > 1. This is true for all 
unit vectors and we can even get it simultaneously for a dense set {en}nen 
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of unit vectors in the unit sphere. Assume the limsup is 1 + € > 1. Then, 
there exists e,, such that 


P{lim sup ee 


ae On >1+5}=1 


in contradiction to the law of iterated logarithm for Brownian motion. 
Therefore, we have lim sup = 1. By reflection symmetry, liminf = —-1. O 


Remark. It follows that in d dimensions, the set of limit points of B:/A(t) 
for t — 0 is the entire unit ball {|v| < 1}. 


4.9 The theorem of Dynkin-Hunt 


Definition. Denote by I(k,) the interval (5 oa ; *). If T is a stopping time, 
then T”) denotes its discretisation 


Tw => 1i(k,n) (Tw) = an 


which is again a stopping time. Define also: 
Ar+ ={AE An | AN{T <t} © Ap, Vt}. 


The next theorem tells that Brownian motion starts afresh at stopping 
times. 


Theorem 4.9.1 (Dynkin-Hunt). Let T be a stopping time for Brownian 
motion, then B, = Bi4r — Br is Brownian motion when conditioned to 
{T < co} and B; is independent of Ar+ when conditioned to {T < oc}. 


Proof. Let A be the set {T’ < 00}. The theorem says that for every function 


f( Bt) = o( Best, Berto,--- Best, ) 


with g € C(R") - 
E[f(B:) La] = E[f(B:)] - PLA] 


and that for every set C € Ap+ 
E[f(Bt)Lanc] - P[A] = E[f(B:)14] -P[ANC]. 
This two statements are equivalent to the statement that for every C € Ap+ 


E[f(Be) - lanc] = E[f(Be)] -P[ANC] . 
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Let 7 be the discretisation of the stopping time T and A, = {T” < oo} 
as well as An, = {7 = k/2"}. Using A = {T < oo}, P[Uf2y Ann] > 
P[ANC] for n — oo, we compute 


E[f(B:)lanc] = dim Elf (Bro )Lanncl 


= jim, » Elf (Bran )1A, 40] 


k=0 
= lim DU EIS(Bo)] -PlAne OC] 
k=0 


= E[f(Bo)) lim P[LJ Ane OC] 
k=1 

= E[f(Bo)lanc] 

= E[f(Bo)]-P[ANC] 

= E[f(B)|-P[ANC]. 


0 


Remark. If T < oo almost everywhere, no conditioning is necessary and 
Bir — Br is again Brownian motion. 


Theorem 4.9.2 (Blumental’s zero-one law). For every set A € Ag+ we have 
P[A] = 0 or P[A] = 1. 


Proof. Take the stopping time T which is identically 0. Now B=Bur- 
B, = B. By Dynkin-Hunt’s result, we know that B = B is independent of 
Br+ = Ao+. Since every C € Ag+ is {Bs,s > 0} measurable, we know that 
Ao+ is independent to itself. oO 


Remark. This zero-one law can be used to define regular points on the 
boundary of a domain D € R?. Given a point y € 6D. We say it is regular, 
if Py[Tsp > 0] = 0 and irregular P,[Tsp > 0] = 1. This definition turns 
out to be equivalent to the classical definition in potential theory: a point 
y € 6D is irregular if and only if there exists a barrier function f : N > R 
in a neighborhood N of y. A barrier function is defined as a negative sub- 
harmonic function on int(N M D) satisfying f(x) — 0 for z — y within 
D. 


4.10 Self-intersection of Brownian motion 


Our aim is to prove the following theorem: 
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ee 


Theorem 4.10.1 (Self intersections of random walk). For d < 3, Brownian 
motion has infinitely many self intersections with probability 1. 


OO 


Remark. Kakutani, Dvoretsky and Erdés have shown that for d > 3, there 
are no self-intersections with probability 1. It is known that for d < 2, there 
are infinitely many n—fold points and for d > 3, there are no triple points. 


ee 


Proposition 4.10.2. Let K be a compact subset of R? and T the hitting time 
of K with respect to Brownian motion starting at y. The hitting probability 
A(y) = Ply+ Bs € K,T < s < oo] is a harmonic function on R4 \ K. 


_ —_—_————————s———— 


Proof. Let Ts be the hitting time of Ss = {|z — y| = 6}. By the law of 
iterated logarithm, we have Ts < oo almost everywhere. By Dynkin-Hunt, 
we know that B; = Bri7, — B; is again Brownian motion. 


If 6 is small enough, then y+ B, ¢ K for t < Ts. The random variable 
Br, € Ss has a uniform distribution on $s; because Brownian motion is 
rotational symmetric. We have therefore 

hy) = Ply+B.€K,s >To) 
Ply+ Br, + Be K] 


h(y+ 2) du(z) , 
56 


where yz is the normalized Lebesgue measure on Ss. This equality for small 
enough 6 is the definition of harmonicity. Oo 


Proposition 4.10.3. Let K be a countable union of closed balls. Then 
h(K,y) > 1 for y — K. 


Proof. (i) We show the claim first for one ball K = B,(z) and let R = |z—y}. 
By Brownian scaling B; ~ c- B, /c2- The hitting probability of K can only 
be a function f(r/R) of r/R: 

h(y, K) =Ply+ B, € K,T <5] Pley + Bsjo2 € cK, TK < s] 
= Pley+ Byj2 € cK, Tex < s/c’] 
Pley + Bs,TcK < 5] 
= h(cy,cK).— 
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We have to show therefore that f(x) + 1 as x > 1. By translation invari- 
ance, we can fix y = yo = (1,0,...,0) and change Ka, which is a ball of 
radius a around (—a,0,...). We have 


h(yo, Ka) = f(a/(1+a)) 
and take therefore the limit a — oo 
im h(yo, Ky) = h(yo, |) Ka) 
= Blinf(Bs)1 <-lj/=1 


lim f(z) 


because of the law of iterated logarithm. 
(ii) Given y, — yo € K. Then yo € Ko for some ball Ko. 


liminf h(yp,K) > lim h(yn, Ko) =1 
n—- oo n—Oo 


by (i). o 


Definition. Let y be a probability measure on R*. Define the potential 
theoretical energy of j: as 


Hu) = ff te vi aute) duty) 


Given a compact set K C R°, the capacity of K is defined as 


inf I(pu))', 
el ty) 
where M(K) is the set of probability measures on K. A measure on K 
minimizing the energy is called an equilibrium measure. 


Remark. This definitions can be done in any dimension. In the case d = 
2, one replaces |x — y|~) by log|z — y|~}. In the case d > 3, one takes 
|x — y|~"¢-?). The capacity is for d = 2 defined as exp(— inf, I(j)) and for 
d > 3 as (inf, I(p))~¢-?). 


Definition. We say a measure 7, on R¢ converges weakly to y, if for all con- 
tinuous functions f, f f dun — f f du. The set of all probability measures 
on a compact subset E of R? is known to be compact. 


The next proposition is part of Frostman’s fundamental theorem of poten- 
tial theory. For detailed proofs, we refer to [39, 80]. 


Proposition 4.10.4. For every compact set K C R?, there exists an equilib- 
rium measure y on K and the equilibrium potential f |z — y|~(¢-?) du(y) 
rsp. f log(|xz — y|~1) du(y) takes the value C(K)~! on the support K* of 
pL. 
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Proof. (i) (Lower semicontinuity of energy) If u, converges to 4, then 


lim inf I(4n) > I(y) « 


(ii) (Existence of equilibrium measure) The existence of an equilibrium mea- 
sure 44 follows from the compactness of the set of probability measures on 
K and the lower semicontinuity of the energy since a lower semi-continuous 
function takes a minimum on a compact space. Take a sequence pi, such 
that 


I(pn inf I(w). 
(Hn) > ant ies (u) 
Then fn has an accumulation point p and I(y) < infyemcKy J(u). 


(iii) (Value of capacity) If the potential $(zx) belonging to is constant on 
K, then it must take the value C(K)~! since 


/ O(a) du(z) = I(u) . 


(iv) (Constancy of capacity) Assume the potential is not constant C(K)~! 
on K*. By constructing a new measure on K* one shows then that one can 
strictly decrease the energy. This is physically evident if we think of ¢ as 
the potential of a charge distribution y on the set K. O 


Corollary 4.10.5. Let yu be the equilibrium distribution on K. Then 
h(y, K) = d,-C(K) 
and therefore h(y, K) > C(K) -infrex |z — y|7. 


Proof. Assume first K is a countable union of balls. According to proposi- 
tion (4.10.2) and proposition (4.10.3), both functions h and ¢, -C(K) are 
harmonic, zero at oo and equal to 1 on 6(A’). They must therefore be equal. 
For a general compact set K, let {yn} be a dense set in K and let K. = 
U,, Be(yn). One can pass to the limit « + 0. Both h(y, K.) — h(y, K) and 
infrex, |z — yl! — infrex |x — y|~} are clear. The statement C(K.) — 
C'(K) follows from the upper semicontinuity of the capacity: if G, is a se- 
quence of open sets with NG,, = E, then C(G,,) — C(E). 

The upper semicontinuity of the capacity follows from the lower semicon- 
tinuity of the energy. Oo 
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NN 


Proposition 4.10.6. Assume, the dimension d = 3. For any interval J = 
[a, bl, the set 
By(w) = {Bi(w) | t € [a, 8)} 


has positive capacity for almost all w. 


a IS 


Proof. We have to find a probability measure p(w) on By(w) such that its 
energy I(j:(w)) is finite almost everywhere. Define such a measure by 


ise lad Bee a 


dp(A) = ia) 


‘ 
Then 


ru) = ff levi dua)duty) = ' i, “(b—a)""|B, — Bil"? dedt 


To see the claim we have to show that this is finite almost everywhere, we 
integrate over Q which is by Fubini 


BW) = f / "(b— a)" "BI|By — Bal] dat 


which is finite since B, — B; has the same distribution as /s—tB, by 
Brownian scaling and since E||By|~1] = f |z|-te7!*!'/* dx < oo in dimen- 
sion d>2and f’ [’ /s—t ds dt < co. O 


Now we prove the theorem 


Proof. We have only to show that in the case d = 3. Because Brownian 
motion projected to the plane is two dimensional Brownian and to the line 
is one dimensional Brownian motion, the result in smaller dimensions fol- 
low. 


(i) @ = PlUrejo,1),0>2 Be = Bz] > 0. 

Proof. Let K be the set Uteto,1] B,. We know that it has positive capacity 
almost everywhere and that therefore h(B,,K) > 0 almost everywhere. 
But h(B,,K) = a since B,2 — Bs is Brownian motion independent of 
B,,0<s<l. 


(ii) or = PlUrepoye<r Be = B,| > 0 for some T > 0. Proof. Clear since 
ar — a for T > oo. | 
(iii) Proof of the claim. Define the random variables Xn = lc, with 


Cr, = {w | B; = Bz, for some t € [nT,nT + 1],5 € [nT +2,(n+1)T]}. 


They are independent and by the strong law of large numbers 5°, Xn = 00 
almost everywhere. — 
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SSS 


Corollary 4.10.7. Any point B,(w) is an accumulation point of self-crossings 
of {B; (w) }edo- 


Proof. Again, we have only to treat the three dimensional case. Let T > 0 
be such that 
ar=Pi |} B=B,)>0 
t€(0,1],2<T 


in the proof of the theorem. By scaling, 

P[B; = B,|t € (0, A], s € [26,78] | 
is independent of 3. We have thus self-intersections of the random walk in 
any interval (0, 6] and by translation in any interval [a, bj. O 
4.11 Recurrence of Brownian motion 


We show in this section that like its discrete brother, the random walk, 
Brownian motion is transient in dimensions d > 3 and recurrent in dimen- 
sions d < 2. 


Lemma 4.11.1. Let T be a finite stopping time and Rr (w) be a rotation in 
R? which turns By(w) onto the first coordinate axis 


Rr(w)Br(w) = (|Br(w)|,0,...0) . 
Then B: = Rr(Bi+r — Br) is again Brownian motion. 


TT  SFSSFSSSSSSSSSSSSSSSsFsFeFeFeFeseseseseFs 


Proof. By the Dynkin-Hunt theorem, B: = Bi47r — Br is Brownian motion 
and independent of Ar. By checking the definitions of Brownian motion, 
it follows that if B is Brownian motion, also R(z)B; is Brownian motion, 
if R(z) is a random rotation on R? independent of B;. Since Rr is Ar 
measurable and B, is independent of Ar, the claim follows. O 


eT eS 


Lemma 4.11.2. Let K, be the ball of radius r centered at 0 € R¢ with 
d > 3. We have for y ¢ K, 


h(y, Kr) = (r/lyl)4~? . 
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Proof. Both h(y, K,) and (r/ |y|)¢-? are harmonic functions which are 1 at 
6K, and zero at infinity. They are the same. 0 


a 


Theorem 4.11.3 (Escape of Brownian motion in three dimensions). For 
d > 3, we have lim:_,.o | Bz| = oo almost surely. 


Rt 


Proof. Define a sequence of stopping times T;,, by 
= inf{s > 0||Bs| =2"}, 


which is finite almost everywhere because of the law of iterated logarithm. 
We know from the lemma (4.11.1) that 


B, = Rr, (Bi+r, — Br) 


is a copy of Brownian motion. Clearly also |Br,,| = 2”. 

We have B, € K,(0) = {\|z| < r} for some s > T, if and only if B, € 
(2",0...,0) + K,(0) for some t > 0. 

Therefore using the previous lemma 


P[Bs € K,(0); 8 > Tn] = P[B: € (2",0...,0) + K,(0);t > 0] = (| oe 


which implies in the case r2~" < 1 by the Borel-Cantelli lemma that for 
almost all w, B,(w) > r for s > Ty. Since Ty is finite almost everywhere, 
we get liminf, |B,| >. Since r is arbitrary, the claim follows. O 


Brownian motion is recurrent in dimensions d < 2. In the case d = 1, this 
follows readily from the law of iterated logarithm. First a lemma 


i 


Lemma 4.11.4. In dimensions d = 2, almost every path of Brownian motion 
hits a ball K, if r > 0: one has h(y, K) = 1. 


Proof. We know that h(y) = h(y, K) is harmonic and equal to 1 on 6K. It 
is also rotational invariant and therefore h(y) = a+blog |y|. Since h € (0, 1] 
we have h(y) = a and soa = 1. 0 


Theorem 4.11.5 (Recurrence of Brownian motion in 1 or 2 dimensions). Let 
d <2 and S be an open nonempty set in R¢. Then the Lebesgue measure 
of {t | By € S} is infinite. 


230 Chapter 4. Continuous Stochastic Processes 


Proof. It suffices to take S = K, (xo), a ball of radius r around zo. Since 
by the previous lemma, Brownian motion hits every ball almost surely, we 
can assume that rp = 0 and by scaling that r = 1. 

Define inductively a sequence of hitting or leaving times T;,, S,, of the 
annulus {1/2 < |x| < 2}, where T, = inf{t | |B,| = 2} and 


Sn 
Th 


inf{t > T» | |By| = 1/2} 
inf{t > Sp1 | |Bz| = 2}. 


i 


These are finite stopping times. The Dynkin-Hunt theorem shows that S,, — 
T, and T, — Sp—1 are two mutually independent families of IID random 
variables. The Lebesgue measures Y,, = |I,,| of the time intervals 


T, = {t | |B:| <1, InSts Thar}, 


are independent random variables. Therefore, also X, = min(1,¥Y,) are 
independent bounded IID random variables. By the law of large numbers, 
>>, Xn = co which implies >, Yn = 00 and the claim follows from 


I{t € [0,00) | |B:| <1 }/> S07. 


n 


O 


Remark. Brownian motion in R¢ can be defined as a diffusion on R@ with 
generator A/2, where A is the Laplacian on R?. A generalization of Brow- 
nian motion to manifolds can be done using the diffusion processes with 
respect to the Laplace-Beltrami operator. Like this, one can define Brown- 
ian motion on the torus or on the sphere for example. See [57]. 


4.12 Feynman-Kac formula 


In quantum mechanics, the Schrédinger equation iit» = Hu defines the 
evolution of the wave function u(t) = e~*#/"u(0) in a Hilbert space H. The 
operator H is the Hamiltonian of the system. We assume, it is a Schrédinger 
operator H = Hj + V, where Hp = —A/2 is the Hamiltonian of a free 
particle and V : R¢ — R is the potential. The free operator Ho already is 
not defined on the whole Hilbert space = L?(R%) and one restricts H to 
a vector space D(H) called domain containing the in dense set C$°(R*) 
of all smooth functions which are zero at infinity. Define 


D(A*) = {u € H | v+ (Av, u) is a bounded linear functional on D(A)}. 


If u € D(A*), then there exists a unique function w = A*u € H such that 
(Av, u) = (v,w) for all u € D(A). This defines the adjoint A* of A with 
domain D(A*). 


Definition. A linear operator A: D(A) C H — H is called symmetric if 
(Au, v) = (u, Av) for all u,v € D(A) and self-adjoint, if it is symmetric and 
D(A) = D(A*). 
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Definition. A sequence of bounded linear operators A, converges strongly 
to A, if A,u — Au for all u € H. One writes A = s — limp—soo An- 


Define e4 = 14+ A+ A?/2! + A3/3! +--+. We will use the fact that a 
self-adjoint operator defines a one parameter family of unitary operators 
t + eA which is strongly continuous. Moreover, eA leaves the domain 
D(A) of A invariant. For more details, see (81, 7]. 


NN 


Theorem 4.12.1 (Trotter product formula). Given self-adjoint operators 
A, B defined on D(A), D(B) Cc H. Assume A + B is self-adjoint on D = 
D(A) D(B), then 


eit(At+B) =s— lim (etait ainye ; 


NCO 


If A, B are bounded from below, then 


e-tAt+B) =s— lim fe tAine eine 


n—+Co 


Proof. Define 
St fe elt(AtB) vy, = era. W, = etB U, = VW, 


and v; = S;v for v € D. Because A+B is self-adjoint on D, one has % € D. 
Use a telescopic sum to estimate 


n-1 
(Se -— URL = SLU, (Sijn — Vern) Siz Ul 
0 


lA 


n sup |{(St/n — Ut/n)vsll - 
O<s<t 


We have to show that this goes to zero for n — oo. Given u € D = 
D(A) D(B), 


S,—-1 —1 
lim ~2—u = i(A + B)u = lim ——u 
s—0 Ss s—0 8 
so that for each u € D 
tim m-[l(Sym — Uepn dull = 0- (4.3) 
The linear space D with norm |||u||| = ||(A + B)ul| + |lul| is a Banach 


space since A + B is self-adjoint on D and therefore closed. We have a 
bounded family {n(St/n — Utjn) }nen of bounded operators from D to H. 
The principle of uniform boundedness states that 


lin(Stjn — Utyn)ull SC = {Ilelll . 
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An ¢/3 argument shows that the limit (4.3) exists uniformly on compact 
subsets of D and especially on {vs}sejo,t) C D and so nsupo<s<t l(St/n — 
U:/n)Us|| = 0. The second statement is proved in exactly the same way. O 


Remark. Trotter’s product formula generalizes the Lie product formula 
A B 
lim (exp(—) exp(—))” = exp(A + B) 
n—0o n n 
for finite dimensional matrices A, B, which is a special case. 
a SL 


Corollary 4.12.2. (Feynman 1948) Assume H = Ho + V is self-adjoint on 
D(H). Then 


: Qrit ; 
e*Hu(x9) = lim Cae: e'Sn(Z0t1.Fa-tmitla(g dz, ie drn 
nc 71 (R4)7 


where 


t = 1 Li Lye 
Sn(Zo,21,...,2n,t) = . Yo 5a —V(a;). 
i=1 


Proof. (Nelson) From & = —iHou, we get by Fourier transform é = ule 
which gives a,(k) = exp(é E14) tig(k) and by inverse Fourier transform 


e tou (z) = u(r) = (2mity-4/2 f eo u(y) dy . 
Re 


The Trotter product formula 


e t(Hot+V) =s— lim (ettHo/neitV/n)n 
n—0o 


gives now the claim. Oo 


Remark. We did not specify the set of potentials, for which Hp + V can be 
made self-adjoint. For example, V € C9°(R”) is enough or V € L?(R3) n 
L(R3) in three dimensions. 


We have seen in the above proof that e~**”o has the integral kernel P.(z,y) = 
(2rit)—4/ relay The same Fourier calculation shows that e~*#° has the 
integral kernel 

P(a,y) = (2nt)~4/2e~ a8 
where g¢ is the density of a Gaussian random variable with variance ft. 


Note that even if u € L?(R*) is only defined almost everywhere, the func- 
tion u(r) = e~Pou(x) = f P,(x — y)u(y)dy is continuous and defined 
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everywhere. 


Lemma 4.12.3. Given fi,..., fn € L°(R4)NL?(R¢) and 0 < 81 <--- < Sn. 
Then 


(eto f, ...e~f> Ho f,.)(0) = f Bu) fa(Bon) AB, 


where t; = 51,t; = 8; — $:-1,i > 2 and the f,; on the left hand side are 
understood as multiplication operators on L?(R?). 


Proof. Since B,,, Bs, — Bs,,...Bs, — Bs, ave mutually independent 
Gaussian random variables of variance t),t2,...,tn, their joint distribu- 
tion is 

Pr, (0, 91) P. (0, y2) eee Pi, (0, yn) dy 


which is after a change of variables y; = 71, y; = Zi — Li-1 
P;,, (0, £1) Pi, (#1, 22)... Pi, (@n—1,2n) dz . 
Therefore, 
[ fBa)-falBs,) 4B 


i= Pi, (0, ¥1) Pe, (0, Yy2) vee P.,, (0, Yn) fr (yi) see fn(yn) dy 


= Py (0, £1) Pt, (x1, £2) ne Pi, (@n-1, Zn) fi(x1) zi fn(Zn) dx 


(e710 fy. em F,)(0) . 


lI 


O 


Denote by dB the Wiener measure on C((0,00),R¢) and with dz the 
Lebesgue measure on R?. We define also an extended Wiener measure 
dW = dz x dB on C({0, 00), R*) on all paths s+ W, = «+ B, starting at 
ze R¢, 


Corollary 4.12.4. Given fo, fi,..., fn € L©(R?) N L?(R®) and 0 < 5; < 
+++ < Sy. Then 


[ tol) -++ fa(Ws,,) dW = (fo, ees, ] pe ng) : 
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Proof. (i) Case so = 0. From the above lemma, we have after the dB 
integration that 


J tol.) ace fn(W,,, ) dW [ fo(z)e~"” f, (x) er e~ tno f(r) dr 


(fo, eto fF ase ery) . 


(ii) In the case so > 0 we have from (i) and the dominated convergence 
theorem 


/ fo(Ws) +++ fn(We,,) dw 


= lim ag tl#l<R}(Wo) 


R-+00 


fo(Wsp) 525 fn(Ws,,) dW 


= dim (foe ML isicr}s ene fess “grr Fa) 
= (foe Mfrs ef) . 
O 


We prove now the Feynman-Kac formula for Schrédinger operators of the 
form H = H)+V with V € C§°(R?). Because V is continuous, the integral 
ie V(W,(w)) ds can be taken for each w as a limit of Riemann sums and 
i V(W,) ds certainly is a random variable. 


Theorem 4.12.5 (Feynman-Kac formula). Given H = Hp + V with V € 
C§° (R4), then 


(f,e-g) = i F(Wo)g(Wi)e~ Le VOW) ds ayy 


Proof. (Nelson) By the Trotter product formula 
(f,eg) = lim (f, (e#e/e*V/")"g) 


so that by corollary (4.12.4) 


n-1 


(eg) = Jim. FWo)gWiyexo(—= DV Wisjn)) W (4.4) 
j=0 


and since s +> W, is continuous, we have almost everywhere 


n-1 


t t 
~ SOV (Win) > i V(W,) ds . 


j=0 
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The integrand on the right hand side of (4.4) is dominated by 

Lf (Wo)| - |g(We)| -efl¥l= 
which is in L'(dW) because again by corollary (4.12.4), 


[170 - lacy] a7 = (IFl,e-*al) < 00 


The dominated convergence theorem leads us now to the claim. O 


Remark. The formula can be extended to larger classes of potentials like 
potentials V which are locally in L}. The selfadjointness, which needed in 
Trotter’s product formula, is assured if V € L? 0 L? with p > d/2. Also 
Trotter’s product formula allows further generalizations (93, 31]. 


Why is the Feynman-Kac formula useful? 


e One can use Brownian motion to study Schrédinger semigroups. It al- 
lows for example to give an easy proof of the ArcSin-law for Brownian 
motion. 


e One can treat operators with magnetic fields in a unified way. 


Functional integration is a way of quantization which generalizes to 
more situations. 


e It is useful to study ground states and ground state energies under 
perturbations. 


e One can study the classical limit h — 0. 


4.13 The quantum mechanical oscillator 


The one-dimensional Schrodinger operator 


Lge r 121 

2dx2 2 2 
is the Hamiltonian of the quantum mechanical oscillator. It is a quantum 
mechanical system which can be solved explicitly like its classical analog, 


which has the Hamiltonian H(zx,p) = $p* + $2? - k. 


H=H)+U= 


One can write 

H = AA*—-1=A*‘A, 
with 
1 7 d 1 d 


Gp pe ag ‘: 


The first order operator A” is also called particle creation operator and A, 
the particle annihilation operator. The space Cf° of smooth functions of 
compact support is dense in L?(R). Because for all u,v € Cp°(R) 


(Au, v) = (u, A*v) 
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the two operators are adjoint to each other. The vector 


is a unit vector because 03 is the density of a N(0,1/./2) distributed ran- 
dom variable. Because AQ) = 0, it is an eigenvector of H = A*A with 
eigenvalue 1/2. It is called the ground state or vacuum state describing the 
system with no particle. Define inductively the n-particle states 


1 
aay 


by creating an additional particle from the (n — 1)-particle state Qn_1 . 


A*OQn—1 


Figure. The first Hermit func- 
tions Qn. They are unit vectors 
in L*(R) defined by 


Hn(x)wo(z) 
V2°n! 


where H,(x) are Hermite poly- 
nomials, Ho(z) = 1, Hy (x) = 
22, Ho(z) = 4x? — 2,H3(r) = 
823 —122,.... 


D(a) = 


Theorem 4.13.1 (Quantum mechanical oscillator). The following properties 
hold: 

a) The functions are orthonormal (Qn, Qn) = on,m- 

b) AQa = JNOQn-1, AO = Vn + 1041. 

~ c) (n— §) are the eigenvalues of H 


a oe 1 
H = (ATA ~ 5)% = (n- 5)Qn 


d) The functions 2, form a basis in L?(R). 


Proof. Denote by [A, B] = AB — BA the commutator of two operators A 
and B. We check first by induction the formula 


[A,(A*)"] =n-(A*)?™. 
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For n = 1, this means [A, A*] = 1. The induction step is 


[A, (A*)"] [A, (A*)""1]A* + (A*)""[A, A] 


(n “aut 1)(A*)"7} ae (ary) = n(A*)?-! : 


ll 


a) Also 
((A*)"Q, (A*) No) = 1 dmn - 


can be proven by induction. For n = 0 it follows from the fact that Qo is 
normalized. The induction step uses [A, (A*)"] = n-(A*)"~* and AN = 0: 


((A*)"Qo, (A*)" Qo) (A(A*)"Q0(A*)™—? Qo) 
([A, (A*)"]9(A*)™" Qo) 


n((A*)"“!09, (A*)™—!9) ; 


If n < m, then we get from this 0 after n steps, while in the case n = m, 
we obtain ((A*)"Q, (A*)"Mo) = n= ((A*)"—1N, (A*)"—1No), which is by 
induction n(n — 1)!6n-1n-1 = 1. 
b) A*Q, = /n+1-OQn41 is the definition of Qn. 

1 


1 
AQ, = A(A*)"Q = —= 
Vai "9 vn! 


c) This follows from b) and the definition Q, = TA Qn-1- 


nN = VIQn-1 : 


d) Part a) shows that {Q,}%2o it is an orthonormal set in L?(R). In order 
to show that they span L?(R), we have to verify that they span the dense 
set 

S = {f € CH(R) | oF (a) + 0, |x] > 00,Vm,n EN } 


called the Schwarz space. The reason is that by the Hahn-Banach theorem, 
a function f must be zero in L?(R) if it is orthogonal to a dense set. So, 
lets assume (f,Q,) = 0 for all n. Because A* + A = J2x 


0 = Vnl2" (f, Qn) = (f,(A*)"Mo) = (fF, (A® + A)"Q0) = 2"/? (f,x"Qo) 


we have 


(fo) (k) 


I 


f f (x)Qo(x)e** dx 
(f, Qoe"**) = (f, 5° hey" N) 


n>0 


5 Ey, at) = 0. 


nr: 
n>0 


lt 


I 


and so fQo = 0. Since NQo(z) is positive for all z, we must have f = 0. This 
finishes the proof that we have a complete basis. O 
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Remark. This gives a complete solution to the quantum mechanical har- 
monic oscillator. With the eigenvalues {A, = n—1/2}2°., and the complete 
set of eigenvectors 2,, one can solve the Schrédinger equation 


d 
hau = Hu 
by writing the function u(x) = °° 9 unQn(z) as a sum of eigenfuctions, 
where Un = (u, Q). The solution of the Schrédinger equation is 


le °) 
ae Ye une /20, (x) ‘ 


n=0 


Remark. The formalism of particle creation and annihilation operators 
can be extended to some potentials of the form U(x) = q?(x) — q’ (x) the 
operator H = —D?/2+U/2 can then be written as H = A* A, where 


22a | d 
aa) ee in” A= yaa) si re) . 


The oscillator is the special case q(x) = x. See [12]. The Backlund transfor- 
mation H = A*A++ H = AA’ is in the case of the harmonic oscillator the 
map H ++ H +1 has the effect that it replaces U with U = U — 0? log No, 
where (29 is the lowest eigenvalue. The new operator H has the same spec- 
trum as H except that the lowest eigenvalue is removed. This procedure 
can be reversed and to create ”soliton potentials” out of the vacuum. It 
is also natural to use the language of super-symmetry as introduced by 
Witten: take two copies Hs @ Hy of the Hilbert space where ” f” stands for 
Fermion and ”b” for Boson. With 


0 At 120 
Oat |pPelo a | 


one can write H @H = Q?, P? = 1,QP + PQ = 0 and one says (H, P,Q) 
has super-symmetry. The operator Q is also called a Dirac operator. A 
super-symmetric system has the property that nonzero eigenvalues have 
the same number of bosonic and fermionic eigenstates. This implies that H 
has the same spectrum as H except that lowest eigenvalue can disappear. 


Remark. In quantum field theory, there exists a process called canonical 
quantization, where a quantum mechanical system is extended to a quan- 
tum field. Particle annihilation and creation operators play an important 
role. 


4.14 Feynman-Kac for the oscillator 


We want to treat perturbations L = Lo + V of the harmonic oscillator 
Lo with an similar Feynman-Kac formula. The calculation of the integral 
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kernel p; (x, y) of e~*“° satisfying 


(e-* f\(2) = / pel, 9) f(y) dy 
R 


is slightly more involved than in the case of the free Laplacian. Let Qo be 
the ground state of Lo as in the last section. 


Lemma 4.14.1. Given fo, fi,...,fn € E©(R) and —00 < 8s) < 31 < +++ < 
Sn < co. Then 


(Qo, foe? fa +» e~ © F,,.00) = f fol Qs0)-* falQen) dQ, 


where to = 80,t; = $; — $-1,1> 1. 


Proof. The Trotter product formula for Lg = Hp + U gives 


(Qo, foe”? fy ---e7*"" f,.09) 
lim (Qo, fo(e*t Ho/m e-hU/mi ym feces etn Ho Ff 0) 


m=(mM1,...;Mn),Mi—oo 


J t0lc0)--- Fula) dGm(z, y) 


and Gm is a measure. Since e~*#° has a Gaussian kernel and e~?Y is a 
multiple of a Gaussian density and integrals are Gaussian, the measure dG, 
is Gaussian converging to a Gaussian measure dG. Since Lo(r) = zQo 
and (4, 7) = 1/2 we have 


[0 dG = (2, e~%~*) Loz) = se ees) 


which shows that dG is the joint probability distribution of Q.,,...Qs,- 
The claim follows. Oo 


Theorem 4.14.2 (Mehler formula). The kernel p:(z, y) of Lo is given by the 
Mehler formula 


1 z? + y")(1+e~%) — 4rye-* 
p(x, y) = Waa (Seve) 4 


with o? = (1 —e7*), 
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Proof. We have 


(em*tog) = f Fu) 95" ya(x)M5"(2) aac, y) = / f(u)pe(,y) d 


with the Gaussian measure dG having covariance 


We get Mehler’s formula by inverting this matrix and using that the density 
is 
(27) det(A)~1/2¢7 (#9), A(@,y)) . 
O 


Definition. Let dQ be the Wiener measure on C(R) belonging to the os- 
cillator process Q;. 


; 


Theorem 4.14.3 (Feynman-Kac for oscillator process). Given L = Ip + V 
with V € C§°(R), then 


(f%,e7@ gM) = / F(Qo)9(Qz)e~ fo (2s) 48 ag 


for all f,g € L?(R, dz). 


Proof. By the Trotter product formula 
(FQ, 7" go) = tim (FO, (e~to/n e~tV/n\n99),) 


so that 
1 -_ t n-1 
(fQ%, e~gM) = jim, [ F(Qo)9(2) exp(—— S"V(Qty/n)) dQ. (4.5) 
j=0 
and since Q is continuous, we have almost everywhere 
t n-l t : 
PV Qe) + f V(Qs) as. 
M 7=0 0 


The integrand on the right hand side of (4.5) is dominated by 


If (Qo) Ilg(Qe) |e"! 
which is in L*(dQ) since 


[ 1s02e)ti(@1 4Q = (001 fl,e-toIgl) < 00 


The dominated convergence theorem gives the claim. — Oo 
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4.15 Neighborhood of Brownian motion 


The Feynman-Kac formula can be used to understand the Dirichlet Lapla- 
cian of a domain D C R?. For more details, see [93]. 


Example. Let D be an open set in R® such that the Lebesgue measure |D| is 
finite and the Lebesgue measure of the boundary |6D| is zero. Denote by Hp 
the Dirichlet Laplacian —A/2. Denote by kp(£) the number of eigenvalues 
of Hp below E. This function is also called the integrated density of states. 
Denote with Kg the unit ball in R¢ and with |K4q| = Vol(Ka) = 14/?P(¢ + 
1)—! its volume. Weyl’s formula describes the asymptotic behavior of kp(E) 
for large E: 
on FR(E) _ [Kal [DI 

E00 E4/2 Qd/27d ; 

It shows that one can read off the volume of D from the spectrum of the 
Laplacian. 


Example. Put n ice balls Kj,n,1 <j <n of radius r, into a glass of water 
so that n-r, = a. In order to know, how good this ice cools the water it is 
good to know the lowest eigenvalue E, of the Dirichlet Laplacian Hp since 
the motion of the temperature distribution u by the heat equation u = Hpu 
is dominated by e~‘©'. This motivates to compute the lowest eigenvalue of 
the domain D \ Uja1 Kj,n. This can be done exactly in the limit n — oo 
and when ice Kj,, is randomly distributed in the glass. Mathematically, 
this is described as follows: 

Let D be an open bounded domain in R¢. Given a sequence x = (21, 22,...) 
which is an element in DN and a sequence of radii r1,r2,..., define 


Dn = D\ | ){lz- 2i| < rn} - 


i=1 


This is the domain D with n points balls Kj,, with center 21,...£p and ra- 
dius r, removed. Let H(z,n) be the Dirichlet Laplacian on D, and Ex(z,n) 
the k-th eigenvalue of H(x,n) which are random variable E,(n) in z, if D® 
is equipped with the product Lebesgue measure. One can show that in the 
case Nrpn — a 

Ex(n) > Ex(0) + 27a|D|~* 


in probability. Random impurities produce a constant shift in the spectrum. 
For the physical system with the crushed ice, where the crushing makes 
NTn — OO, there is much better cooling as one might expect. 


Definition. Let W(t) be the set 
{x € R* | |x — B,(w)| < 6, for some s € (0, ¢]} . 


It is of course dependent on w and just a 6-neighborhood of the Brownian 
path Bjo,z(w). This set is called Wiener sausage and one is interested in the 
expected volume |W3(t)| of this set as 6 — 0. We will look at this problem 
a bit more closely in the rest of this section. 
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Figure. A sample of Wiener 
sausage in the plane d = 2. A 
finite path of Brownian motion 
with its neighborhood Ws. 


Lets first prove a lemma, which relates the Dirichlet Laplacian Hp = —A/2 
on D with Brownian motion. 


Lemma 4.15.1. Let D be a bounded domain in R? containing 0 and 
pp(«, y,t), the integral kernel of e~ 4 where H is the Dirichlet Laplacian 
on D. Then 


E(B, € D;}0<s<t)/=1 - | po(0,2,0 dz . 


Proof. (i) It is known that the Dirichlet Laplacian can be approximated in 
the strong resolvent sense by operators Hp + AV, where V = Ipe is the 
characteristic function of the exterior D® of D. This means that 


(Ho +\-V)7*u > (Hp - z)~+u,A — 00 
for z outside [0,00) and all u € Co°(R*). 


(ii) Since Brownian paths are continuous, we have Vs V(B,) ds > 0 if and 
only if B, € C° for some s € [0,t]. We get therefore 


ty 
elo V(B.) ds _, 105 eney 


point wise almost everywhere. 


Let un be a sequence in C® converging point wise to 1. We get with the 
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dominated convergence theorem, using (i) and (ii) and Feynman-Kac 
E[B,€D°0<s<t] = lim Elu,(Bs)e D5 0<s< t] 

No 


= lim lim Ele~> Jo V(Bs) dsy,(Bz)] 


nN—00 \-400 


= lim lim e7*(#o++Y)y,, (0) 


n—0oo A—00 


= lim e*#>u,(0) 


= lim J r0(0.2,A)n(0) dz = [ro0.20 dz . 


il 


Theorem 4.15.2 (Spitzer). In three dimensions d = 3, 


B[ls(t)l] = 2nst + 46° VOnt + a? 


Proof. Using Brownian scaling, 


E[|Was(A72)]] 


E[|{|a — B,| < 6,0 < s < d*¢}] 

= BIKIZ- i | <6,0<5=5/» <t}l] 
= oe 5 < t}H] 

= *-B[|We(¢)|] ; 


so that one assume without loss of generality that 6 = 1: knowing E[|W,(¢)]], 
we get the general case with the formula E[|Ws(t)|] = 63 - E[|Wi(6778)|]. 


Let K be the closed unit ball in R?. Define the hitting probability 
f(x,t) =Piz+B,€K;,0<s<i]. 


We have 
E||Wi(t) n= f f( x ,t) dz . 


[ [re  W,(t)] de dB 


| [Pi-2eK0ss<t}de ap 


Proof. 


E[\Wi(4)]] 


= | [Pie ee Koss dB ae 


/ fla,t) de 
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The hitting probability is radially symmetric and can be computed explic- 
itly in terms of r = |z|: for |z| > 1, one has 


2 as a|+z—-1)2 
flz,t)= / on tee 
rv 2rt Jo 


Proof. The kernel of e~*” satisfies the heat equation 
O:p(z, 0, t) = (A/2)p(z, 0, t) 


inside D. From the previous lemma follows that f = (A/2)f, so that the 
function g(r, t) = rf(z, t) satisfies g = sOeP g(r,t) with boundary condition 
g(r, 0) = 0, g(1, t) = 1. We compute 


f(x,t) dx = Int +4V2nt 
|x|>1 : 


and Siei<a f(z, t) dx = 41/3 so that. 


E[|Wi(t)| = 2at + 4V2at + 47/3 . 


Corollary 4.15.3. In three dimensions, one has: 
lim +E[|W5(¢)|] = 2nt 
570 5 aa eee 


and 1 
Jim, + «El Welt)]] = 205 


Proof. The proof follows immediately from Spitzer’s theorem (4.15.2). O 


Remark. If Brownian motion were one-dimensional, then 5~2E[|Ws(t)|] 
would stay bounded as 6 — 0. The corollary shows that the Wiener sausage 
is quite ”fat”. Brownian motion is rather ” two-dimensional” . 


Remark. Kesten, Spitzer and Wightman have got stronger results. It is 
even true that limso|W5(t)|/t = 276 and limto. |Ws(t)|/t = 276 for 
almost all paths. 
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4.16 The Ito integral for Brownian motion 


We start now to develop stochastic integration first for Brownian motion 
and then more generally for continuous martingales. Lets start with a mo- 
tivation. We know by theorem (4.2.5) that almost all paths of Brownian 
motion are not differentiable. The usual Lebesgue-Stieltjes integral 


[ " f(B,)Bs ds 


can therefore not be defined. We are first going to see, how a stochas- 
tic integral can still be constructed. Actually, we were already dealing 
with a special case of stochastic integrals, namely with Wiener integrals 
te f(B.) dB,, where f is a function on C((0, oo}, R*) which can contain for 


example ie V(B,) ds as in the Feynman-Kac formula. But the result of this 
integral was a number while the stochastic integral, we are going to define, 
will be a random variable. 


Definition. Let B; be the one-dimensional Brownian motion process and 
let f be a function f : R — R. Define for n € N the random variable 


2° Pie 
In(f) = S> f(Bem—1)2-*)(Bma-» — Bym—1ya-») =! D5 Inm(f) - 
m=1 mal 


We will use later for Jnm(f) also the notation f(B:,,_,)nBt,,, where 
bn Bi — B; == By_o-n. 


Remark. We have earlier defined the discrete stochastic integral for a pre- 
visible process C and a martingale X 


m=1 


If we want to take for C a function of X, then we have to take Cy = 
f(Xm_1). This is the reason, why we have to take the differentials 6, By, 
to ’stick out into future”. 


The stochastic integral is a limit of discrete stochastic integrals: 


Lemma 4.16.1. If f € C!(R) such that f, f’ are bounded on R, then Jn(f) 
converges in L? to a random variable 


1 
| f(B,)dB = lim Jn 
(0) n-+0O 
satisfying 


1 1 
| i f(B,) dB\3 = E| i f(B,)? ds) . 
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Proof. (i) For i # j we have E[Jn,i(f)Jn,j(f)] = 0. 

Proof. For j > i, there is a factor Bjy-n — Byj—1)2-n of Jn eae inde- 
pendent of the rest of Jn,i(f)Jn,j(f) and the claim follows from E[Bj2-» — 
Bo-1)2-»] = 0. 


(ii) E[Jn,m(f)?] = E[f(Bom—1)2-")?]2~-. 
Proof. f(Bim-1)/2") is independent of (Byo-n — Bim-1)2-)? which has 
expectation 2~”. 


(iii) From (ii) follows 
on 
I|Jn(f)ll2 = $3 ELF (Bon—1y2-*)7]2 
m=1 


(iv) The claim: J, converges in £?. 
Since f € C', there exists C = ||f’||2, and this gives |f(x) — f(y)|? < 
C- |x ~y|?. We get 


[|Jn+i (fF) — In (FIN 
2°-1 
- De El(f(Beam+1)2-+») = f(Biamareey)) 20) 
m=1 
2"°-1 
c > El(Bram41)2-+) ar Beamya-tn4))*]27 FD 
m=1 


OQ, 


lA 


where the last equality followed from the fact that E[(Brom41)2-«@+) — 
Beam)2-(n+1) )?] = 2-” since B is Gaussian. We see that J, is a Cauchy 
sequence in £? and has therefore a limit. 


(v) The claim || {> f(Bs) dB\|3 = Elf, f(B.)? ds. 

Proof. Since 5°, f(Bim-1)2-" n)?2-” converges point wise to So f(B;)* ds, 
(which exists because f and B, are continuous), and is dominated by || f||2,, 
the claim follows since Jp, converges in £7. 


We can extend the integral to functions f, which are locally L! and bounded 
near 0. We write L?,.(R) for functions f which are in L?(I) when restricted 
to any finite interval J on the real line. 


Corollary 4.16.2. ie f(B,) dB exists as a CL? random variable for f € 
Li,e(R) NL (—e, €) and any € > 0. 
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Proof. (i) If f € L},,(R) N L©(—e,€) for some € > 0, then 


loc 


: 2 = : f(x)? —ax?/2s 
Bt f(B;)* ds] =| i dzrds <o. 


(ii) If f € LL. (R) N L©(—e, €), then for almost every B(w), the limit 


1 
loc 
1 
lim / 1}-2,0)(Bs)f(Bs)? ds 
a-oo 0 


exists point wise and is finite. 

Proof. B, is continuous for almost all w so that 1;~,a\(Bs) f(B) is indepen- 
dent of a for large a. The integral E| i 1~-a,q)(Bs) f (B,)? ds] is bounded 
by Elf(B.)? ds] < oo by (i). 


(iii) The claim. 

Proof. Assume f € Lj,.(R)NL™(—e, €). Given fa € C1 (R) with 1j-aajfn > 
f in L?(R). 

By the dominated convergence theorem, we have 


[-aalfa(By) dB f4,~ a,0)f(B.) dB 


in £7. Since by (ii), the L? bound is independent of a, we can also pass to 
the limit a — oo. O 


Definition. This integral is called an Ito integral. Having the one-dimensional 
integral allows also to set up the integral in higher dimensions: with Brow- 
nian motion in R? and f € L?,,(IR?) define the integral (i f(B,) dB, 
component wise. 


Lemma 4.16.3. For n — 00, 


as : 2° 
So Ing (1)? = So ( Bijan — Bu-vyan)? 7 1. 
j=l j=l 


Proof. By definition of Brownian.motion, we know that for fixed n, Jn,; 
are N(0,2~”)-distributed random variables and so 
2° 
E[y” Jn,j(1)?] = 2”. Var[B; 2. = Boj-1)/2"] = 2°27¢= 1. 
j=l 


Now, X; = 2"J,,; are IID N(0, 1)-distributed random variables so that by 
the law of large numbers 


for n — oo. O 


248 Chapter 4. Continuous Stochastic Processes 


The formal rules of integration do not hold for this integral. We have for 
example in one dimension: 


Z 1 1 
i B, 4B = 5(B} 1) # 5(B? — BB). 


Proof. Define 


Qo” 
9 > F(Bom—1)2-2)(Bm2-n — Bim—1)2-") ; 
m=1 
Q”" 
Ie = YT f(Bma-»)(Bm2-» — Bim—1)2-") - 
m=1 
The above lemma implies that J+ —J> — 1 almost everywhere for n — 00 
and we check also J+ + Jz = B?. Both of these identities come from 
cancellations in the sum and imply together the claim. O 


We mention now some trivial properties of the stochastic integral. 


Theorem 4.16.4 (Properties of the Ito integral). Here are some basic prop- 
erties of the Ito integral: 

(1) fo f(Bs) + 9(Bs) dBs = fy f(B,) dB, + J§ 9(Bs) aBy. 

(2) fo A+ f(Bs) dB, = d- f* f(B,) dBy. 

(3) th Sf f(B;) dB, is a continuous map from R* to L?. 

(4) Ely f(Bs) 4B, = 0. 

(5) ie f(Bs) dB, is A; measurable. 


Proof. (1) and (2) follow from the definition of the integral. 
For (3) define X; = fo f(B;) dB. Since 


t+e 
IXe—Xevel® = BLf s(B,)? a 
t 


t+e 2 
= / f(a)" e7*'/29 de ds 0 
t R 278 


for € — 0, the claim follows. 
(4) and (5) can be seen by verifying it first for elementary functions f. O 


It will be useful to consider an other generalizations of the integral. 
Definition. If dW = dzdB is the Wiener measure on R¢ x C ([0, 00), define 


[10 aw.= ices dB, de. 


4.16. The Ito integral for Brownian motion 249 


Definition. Assume f is also time dependent so that it is a function on 
IR? x R. As long as E(fy |f (Bs, s)|? ds} < 00, we can also define the integral 


[feo ds. 
0 


The following formula is useful for understanding and calculating stochas- 
tic integrals. It is the ”fundamental theorem for stochastic integrals” and 
allows to do ”change of variables” in stochastic calculus similarly as the 
fundamental theorem of calculus does for usual calculus. 


Theorem 4.16.5 (Ito’s formula). For a C? function f(r) on R@ 


f (Bt) -— f(Bo) =| Vf(Bs) - aB.+5 | Af(B,) ds . 


If B, would be an ordinary path in R* with velocity vector dB, = B, ds, 
then we had 


f(Be) — f(Bo) = ‘ Vi(B,)- Bs ds 


by the fundamental theorem of line integrals in calculus. It is a bit surprising 
that in the stochastic setup, a second derivative Af appears in a first order 
differential. One writes sometimes the formula also in the differential form 


df = Vf dB + SAF dt. 


Remark. We cite [11]: ”Ito’s formula is now the bread and butter of the 
” quant” department of several major financial institutions. Models like that 
of Black-Scholes constitute the basis on which a modern business makes de- 
cisions about how everything from stocks and bonds to pork belly futures 
should be priced. Ito’s formula provides the link between various stochastic 
quantities and differential equations of which those quantities are the so- 
lution.” For more information on the Black-Scholes model and the famous 
Black-Scholes formula, see [16]. 

It is not much more work to prove a more general formula for functions 
f(z,t), which can be time-dependent too: 


Theorem 4.16.6 (Generalized Ito formula). Given a function f(z, t) on R¢ x 
(0, t] which is twice differentiable in z and differentiable in ¢t. Then 


t t t 
(Bet) f(Bos0) = f Vi(Bas) aB+5 | Af(Bs,5) as+ [ HBS ay ds 
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In differential notation, this means 
1 : 
df= Vf dB +(SAf + f) dt. 


Proof. By a change of variables, we can assume t = 1. For each n, we 
discretized time 


{02 <u itpe ke Qe sl} 
and define 6,B:, = Br, — Be,_,. We write 
Q” 


f(Bi1) —  f(Bo,0) = SO(VF)(Bi ss te-1)6n Bey 


k=1 


Qn. 
ote f(Bt,,tk-1) _ F (Bey, tk-1) _ (VF) (Bey. tk-1)6n Be, 
k=1 


Q" 

ate S> f (Bex, th) _ f (Bz, te-1) 
k=1 

= I,t+Ii,+III,. 


(i) By definition of the Ito integral, the first sum J, converges in L? to 
Jo (VF)(Bs, 8) Bo. 


(ii) If p > 2, we have yo lbnBz, |? + 0 for n — oo. 
Proof. 6, Bz, is a N(0,2~”)-distributed random variable so that 


Co 
E[|dnBe, |?] = (2n)71/227 (79/2) i |x|Pe-®'/? de = CQ-(m)/2 


—-cooO 
This means 


gr 
E(), \5n Be, |?] = C227 %)/2 
k=1 
which goes to zero for n — co and p > 2. 


(iii) yo E[(B:, — Bt,_,)*] — 0 follows from (iz). We have therefore 


> Elg( Be, te)? (By, — Bus)? -2°")7] < OD Varl(By, — Ba_s)*] 
k=1 k=1 
cyo E|(Be, a Bis)" 0. 


k=1 


IA 


(iv) Using a Taylor expansion 


f(a) = f(v) - VFW(e—-9) — 5D Pee, FW e-w)il—v)j + Ole, 
tj : 
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we get for n — 00 


Qn 
1 
IT, _ » 2 2 Bons f(Bex: ta) En Bey) ine =O 


in £7. Since 
DAs 


1 —n 
+ Ones f(Bey-s ’ tk—1)[(On Be, )i(onBe, i 7 6:52 ] 
k=1 


goes to zero in L? (applying (ii) for g = 02,2, f and note that (5,Bz,); and 
(6nBz,); are independent for 7 # j), we have therefore 


t 
II, > sf Af (Bs, s) ds 
2 Jo 
in £?. 
(v) A Taylor expansion with respect to t 
f(z, t) i f(z, s) =. f(z, s)(t oe 8) a O((t ~~ s)”) 
gives 
t 
IIIn > i f(Bs,s) ds 
0 


in £L! because s > f (Bs,s) is continuous and III, is a Riemann sum 
approximation. CO 


Example. Consider the function 
f(a,t) = erro? t/2 : 


Because this function satisfies the heat equation f + f”/2 = 0, we get from 
Ito’s formula 


f(Br,t) — f(Bo,t) =a [ f(Bas)+ 4B, . 


We see that for functions satisfying the heat equation f + f” /2 = 0 Ito’s 
formula reduces to the usual rule of calculus. If we make a power expansion 
in a of 


t 
e%Bs—a? 8/2 dB = 1 (eB,—a?s/2 = 1 ; 
0 a Qa 


we get other formulas like 


t 
i; B, dB = 1(B?-2). 
0 2 
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Wick ordering. 

There is a notation used in quantum field theory developed by Gian-Carlo 
Wick at about the same time as Ito’s invented the integral. This Wick 
ordering is a map on polynomials )7;"_, aiz* which leave monomials (poly- 
nomials of the form 2” + a,_1z"~!---) invariant. 


Definition. Let 
Ay, (x )Qo (x ) 


2"n! 


2,,(z) = 


be the n’-th eigenfunction of the quantum mechanical oscillator. Define 


1 x 


and extend the definition to all polynomials by linearity. The Polynomials 
: 2” : are orthogonal with respect to the measure N2dy = m~1/2e-¥" dy 
because we have seen that the eigenfunctions 2,, are orthonormal. 


Example. Here are the first Wick powers: 


(2: = £ 

a7: = g?-] 

:23: = 2g? —32 

:a4: = gt—627+3 
:2°: = 2° -102°+1b5c. 


Definition. The multiplication operator Q : f +> zf is called the position 
operator. By definition of the creation and annihilation operators one has 


Q=4(4+ 4°). 


The following formula indicates, why Wick ordering has its name and why 
it is useful in quantum mechanics: 


Proposition 4.16.7. As operators, we have the identity 


signs P25 eed , *\i Ans 
Qt = aa (ATA = pad 5 Jaya , 


Definition. Define L = 0", ( ; ) (A*)\JA"-S, 
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Proof. Since we know that 2, forms a basis in L?, we have only to verify 
that :Q” :Q, = 27"/?LQ,, for all k. From 


o-VIg,4) = [Aran so(™ )caarsy 
aH = | a) 


| 
eg 


(% ) stary-tard — (n—syaryar 


il 
&, 
= il 


we obtain by linearity [Hx (V2Q), L]. Because : Q” : A = 27"/?(n!)1?Q, =. 
2-7/2(A*)"Qy = 2-"/2 Lo, we get 


0 = (:Q™:—27"/?L)N% 
(kt)-/? Aa (V8Q)(: Q” : -27"/7L)Qo 


(: Q” : —27/?L) (hk!) /? Ae (V/8Q)o 
GQ's) FD): 


D 


Remark. The new ordering made the operators A, A* behaves as if A, B 
would commutate. even so they don’t: they satisfy the commutation rela- 
tions [A, A*] = 1: 


The fact that stochastic integration is relevant to quantum mechanics can 
be seen from the following formula for the Ito integral: 


Theorem 4.16.8 (Ito Integral of B”). Wick ordering makes the Ito integral 
behave like an ordinary integral. 


t 
1 
“pdb, =— 7 BM +. 
[ : n+l Bi 


Remark. Notation can be important to make a concept appear natural. An 
other example, where an adaption of notation helps is quantum calculus, 
*calculus without taking limits” [44], where the derivative is defined as 
Dgf (xz) = dgf(x)/dq(z) with dj f(x) = f(qx) — f(x). One can see that 
Dz” = [n|x"~", where [n] = ct The limit g — 1 corresponds to the 
classical limit case A — 0 of quantum mechanics. 


Proof. By rescaling, we can assume that t = 1. 
We prove all these equalities simultaneously by showing 


1 
7: 6% ss dB=au7!:e%:-a7!. 
0 
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The generating function for the Hermite polynomials is known to be 
fe) n a2 
Yo Hala) = ev 
n! 
n=0 


(We can check this formula by multiplying it with Qo, replacing z with 
z/\/2 so that we have 
2 


foe) 
Qn(z)a” 2 
5 alee eS 


=C¢ 


If we apply A* on both sides, the equation goes onto itself and we get after 
k such applications of A* that that the inner product with Q, is the same 
on both sides. Therefore the functions must be the same.) 

This means 


a 
: 7 ed 2 
+ ot . J ert ae 


Since the right hand side satisfies f + f”/2 = 2, the claim follows from the 
Ito formula for such functions. O 


We can now determine all the integrals f B? dB: 


t 
[ia = B: 
0 
‘ i 
B,dB = =(B?-1) 
0 2 
t t ‘q : | ; é 
[ Bap = [Bh tab = B+ 5: Be) = B+ 1(B? 3B.) 
0 0 


Stochastic integrals for the oscillator and the Brownian bridge process. 
Let Qt = e~* Ba / V2 the oscillator process and A; = (1 — t)By(i—z) the 
Brownian bridge. If we define new discrete differentials . 


bnQt, = Quy, —e HO, 


trai — tk 
On At, = Ftp ay At, + “aon 
the stochastic integrals can be defined as in the case of Brownian motion 
as a limit of discrete integrals. 
Feynman-Kac formula for Schrédinger operators with magnetic fields. 
Stochastic integrals appear in the Feynman-Kac formula for particles mov- 
ing in a magnetic field. Let A(x) be a vector potential in R* which gives 
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the magnetic field B(x) = curl(A). Quantum mechanically, a particle mov- 
ing in an magnetic field together with an external field is described by the 
Hamiltonian 

H=(iV+A)?+V. 


In the case A = 0, we get the usual Schrédinger operator. The Feynman- 
Kac formula is the Wiener integral ; 


e y(0) = feteeucay dB , 


where F(B, t) is a stochastic integral. 
i ort t 
F(B,t) = if a(B.) dB+ >| div(A) ds + | V(B;) ds . 
0 ) 


4.17 Processes of bounded quadratic variation 


We develop now the stochastic Ito integral with respect. to general martin- 
gales. Brownian motion B will be replaced by a martingale M which are 
assumed to be in £7. The aim will be to define an integral 


t 
| K, dM, , 
0 


where K is a progressively measurable process which satisfies some bound- 
edness condition. 


Definition. Given a right-continuous function f : [0,00) — R. For each 
finite subdivision 
A ={0=to,ti,...,t=tn} 


of the interval (0, t] we define |A| = sup{_, |ti+1 — ti| called the modulus of 
A. Define 


n-1 


lflla me y gave eat : 
i=0 


A function with finite total variation ||f||e = sup, ||f\la < 00 is called a 
function of finite variation. If sup, |f|: < oo, then f is called of bounded 
variation. One abbreviates, bounded variation with BV. 


Example. Differentiable C) functions are of finite variation. Note that for 
functions of finite variations, V; can go to oo for t — oo but if V; stays 
bounded, we have a function of bounded variation. Monotone and bounded 
functions are of finite variation. Sums of functions of bounded variation are 
of bounded variation. 


Remark. Every function of finite variation can be written as f = f +_f7, 
where f* are both positive and increasing. Proof: define f* = (4h + 


I|f lle) /2- 
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Remark. Functions of bounded variation are in one to one correspondence 
to Borel measures on [0,00) by the Stieltjes integral i ldf|= fii + fr. 


Definition. A process X; is called increasing if the paths X;(w) are finite, 
right-continuous and increasing for almost all w € Q. A process X; is called 
of finite variation, if the paths X;(w) are finite, right-continuous and of 
finite variation for almost all w € 9. 


Remark. Every bounded variation process A can be written as A; = Ay - 
Ay, where A# are increasing. The process Y, = fs |\dA|,; = At + Ap is 
increasing and we get for almost all w € 2 a measure called the variation 
of A. 


If X; is a bounded .A;-adapted process and A is a process of bounded 
variation, we can form the Lebesgue-Stieltjes integral 


(X - A)t(w) =| Xs(w) dA,(w) . 


We would like to define such an integral for martingales. The problem is: 


TF sw 


Proposition 4.17.1. A continuous martingale M is never of finite variation, 
unless it is constant. 


st 
Proof. Assume M is of finite variation. We show that it is constant. 

(i) We can assume without loss of generality that M is of bounded varia- 
tion. 

Proof. Otherwise, we can look at the martingale M5", where Sy is the 
stopping time S,, = inf{s | V, >} and V; is the variation of M on (0, ¢]. 


(ii) We can also assume also without loss of generality that My = 0. 


(iii) Let A = {to = 0,t1,...,tn = t} be a subdivision of [0, t]. Since M isa 
martingale, we have by Pythagoras 


k-1 
E[M?] = El) )(M2,, - M2)] 
s 
oe ELM: a Mi,)(Mi., am M:,)] 
a 
og ELS (Mes, aa M:,)?] 


i=1 
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and so 


E(M?] S E(V.(sup |Mi,,, a Mi, ] SK E{sup |Mi,,, = Mi,|] ; 


If the modulus |A| goes to zero, then the right hand side goes to zero since 
M is continuous. Therefore M = 0. O 


Remark. This proposition applies especially for Brownian motion and un- 
derlines the fact that the stochastic integral could not be defined point wise 
by a Lebesgue-Stieltjes integral. 


Definition. If A = {tp = 0 < ty <... } isa subdivision of Rt = [0, 00) with 
only finitely many points {to,ti,...,tx } in each interval (0,t], we define 
for a process X 


k-1 
TP = T;*(X) = (So (Kein ~ Xt,)”) +(X — Xi)" : 


i=0 


The process X is called of finite quadratic variation, if there exists a process 
< X,X > such that for each t, the random variable Tr converges in 
probability to < X,X >; as |A| > 0. 


Theorem 4.17.2 (Doob-Meyer decomposition). Given a continuous and 
bounded martingale M of finite quadratic variation. Then < M,M > is 
the unique continuous increasing adapted process vanishing at zero such 
that M?— < M,M > is a martingale. 


Remark. Before we enter the not so easy proof given in [83], let us mention 
the corresponding result in the discrete case (see theorem (3.5.1), where 
M? was a submartingale so that M? could be written uniquely as a sum 
of a martingale and an increasing previsible process. 


Proof. Uniqueness follows from the previous proposition: if there would be 
two such continuous and increasing processes A, B, then A — B would be 
a continuous martingale with bounded variation (if A and B are increas- 
ing they are of bounded variation) which vanishes at zero. Therefore A = B. 


(i) M? — TA(M) is a continuous martingale. 
Proof. For t; < s < ti+1, we have from the martingale property using that 
(Mt,,, — Ms)? and (M, — M:,)? are independent, 


E[(Mtis1 7 M:,)? | As] = E((Mi,4. me M,)*|Ae] oe (Ms = M:,)? . 
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This implies with 0 = tg <t) <-:-<th<s< ti41 <--- < tye < t and 
using orthogonality 


E[TA(M)-TA(M)|A] = BD. Bh )7|As] 


+ 


eae a M:,) 21As] + E[(M, = Mz,)?|As] 
= El(M, -— M,)|.A.] = E[M? — M?|A,] . 


This implies that M? — TA(M) is a continuous martingale. 


(ii) Let C be a constant such that |M| < C in (0, a]. Then E[T4] < 402, 
independent of the subdivision A = {to,...,tn} of [0, a]. 
Proof. The previous computation in (i) gives for s = 0, using TA(M) = 0 


E[T*(M)|Ao] = E[M? — Mg|Ao] < E[(Me - Mo)(M: + Mo)] < 4C?.. 


(iii) For any subdivision A, one has E[(T4)?] < 48C%4. 
Proof. We can assume t,, = a. Then 


n 


(So (Mi, - Mi,_ Pag 


k=1 


(T2(M))? 


II 


n n 
2 TS ~ Te) Te — Te.) + DO (Ma, — Mays) 
k=1 k=1 


From (i), we have 


E(TA - fg |Az,] = = E|( (Ma — — Mi,)? | At, | 
and consequently, using (ii) 
n 
E((TS)] = 25 E[(Ma a My,)?°(Te - TI ale 5 EI (Mi, - Mi, hk ] 
k=1 k=1 
< El(2sup|Ma <a M:, |? te ing Me, = Mi,_,|?)TA| 


lA 


12C?E[T4] < 48C* . 


(iii) For fixed a > 0 and subdivisions A, of (0, a] satisfying |A,| > 0, the 
sequence T4” has a limit in £?. 

Proof. Given two subdivisions A’, A” of [0,a], let A be the subdivision 
obtained by taking the union of the points of A’ and A”. By (i), the process 
X=T* -T" isa martingale and by (i) again, applied to the martingale 
X instead of M we have, using (x + y)? < 2(x? + y?) 


E[X@] = E((Te — T”)?] = E(PA(X)] < 2B (TA (74) + BITA(T4"))) . 
We have therefore only to show that E[T4(T*’)| > 0 for |A’| + |A”| > 0. 
Let sx be in A and t,, the rightmost point in A’ such that tm < sz, < 
Ski1 <tm+1. We have 
TS - ce a (Ms,4, va M.,,)” ™ (Ms, oe Mz, e 
(Msy41 = Ms,)(M. Sk+1 + Ms, me 2M, ) 
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and so a 
dal ) < cup \Msua1 + Ms, rs 2Mz,,\°)To $ 


By the Cauchy Schwarz-inequality 


BITS (T*’)] < Efsup |Moy y+ Mog — 2Menl4]7E[(TO)?)/? 
k 


and the first factor goes to 0 as |A| — 0 and the second factor is bounded 
because of (iii). 


(iv) There exists a sequence of A, C An+1 such that T"(M) converges 
uniformly to a limit (M, M) on (0, a]. 
Proof. Doob’s inequality applied to the discrete time martingale T4" -T4™ 
gives 

Efsup TA" -— TAn |] < 4E((T2" —T")’] - 

t<a 

Choose the sequence A, such that An+1 is a refinement of A, and such 
that U,, An is dense in [0, a], we can achieve that the convergence is uni- 
form. The limit (M, M) is therefore continuous. 


(v) (M, M) is increasing. 

Proof. Take A, C An4+1- For any pair s < t in U, An, we have TA*(M) < 
TA*(M) if n is so large that A, contains both s and t. Therefore (M, M) 
is increasing on (),, An, which can be chosen to be dense. The continuity 
of M implies that (M, M) is increasing everywhere. Oo 


Remark. The assumption of boundedness for the martingales is not essen- 
tial. It holds for general martingales and even more generally for so called 
local martingales, stochastic processes X for which there exists a sequence 
of bounded stopping times T;, increasing to oo for which X72” are martin- 
gales. 


A 


Corollary 4.17.3. Let M,N be two continuous martingales with the same 
filtration. There exists a unique continuous adapted process (M, N) of finite 
variation which is vanishing at zero and such that 


MN — (M,N) 
is a martingale. 


a 


Proof. Uniqueness follows again from the fact that a finite variation mar- 
tingale must be zero. To get existence, use the parallelogram law 


(M,N) = 1((M+N,M-+N)-(M-N,M—-N)). 
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This is vanishing at zero and of finite variation since it is a sum of two 
processes with this property. 

We know that M? — (M, M), N? — (N,N) and so that (M + N)?-—(M+ 
N,M + N) are martingales. Therefore 


(M +N)? —~(M+N,M+N)-(M-—N)? ~(M-—N,M—N) 
= 4MN-(M+N,M+N)-(M-N,M-N). 


and MN — (M,N) is a martingale. O 


Definition. The process (M,N) is called the bracket of M and N and 
(M, M) the increasing process of M. 


Example. If B = (B®),..., B(®) is Brownian motion, then (< BO, BG) = 
6;jt as we have computed in the proof of the Ito formula in the case t = 1. 
It can be shown that every martingale M which has the property that 


(M®, M)) = biz t 


must be Brownian motion. This is Lévy’s characterization of Brownian 
motion. 


Remark. If M is a martingale vanishing at zero and (M,M ) = 0, then 
M = 0. Since M? — (M, M), is a martingale vanishing at zero, we have 
E[M?] = E[(M, M).]. 


Remark. Since we have got (M,M) as a limit of processes TA, we could 
also write (M,N) as such a limit. 


4.18 The Ito integral for martingales 


In the last section, we have defined for two continuous martingales M,N, 
the bracket process (M,N). Because (M, M) was increasing, it was of fi- 
nite variation and therefore also (M,N) is of finite variation. It defines a 
random measure d(M, N). 


Theorem 4.18.1 (Kunita-Watanabe inequality). Let M,N be two continu- 
ous martingales and H, K two measurable processes. Then for all p,q > 1 
satisfying 1/p + 1/q = 1, we have for all t < 00 


‘ : 2 1/2 
B(/ \Hel\Ke| ld(M,N)ol] << {IC | H2d(M, M))'/2||, 


IC | K2d(N, Ny) Iq . 
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Proof. (i) Define (M, N)§ = (M,N), — (M,N). Claim: almost surely 


(@, NYS < ((M, M)5)*/9((N, N)5)1/? . 
Proof. For fixed r, the random variable 
(M,M)5 + 2r(M,N), +1r?(N,N), =(M+rN,M +N), 


is positive almost everywhere and this stays true simultaneously for a dense 
set of r € R. Since M, N are continuous, it holds for all r. The claim follows, 
since a+ 2rb+cr? > 0 for al! r > 0 with nonnegative a,c implies b < /avc. 


(ii) To prove the claim, it is, using Hélder’s inequality, enough to show 
almost everywhere, the inequality 


| |He| Kel d|(M, N)le < ( / H?d(M, M))¥/? .( | K2d(N, N))¥? 
0 0 0 


holds. By taking limits, it is enough to prove this for t < oo and bounded 
K, H. By a density argument, we can also assume the both K and H are 
step functions H = S7"_, Hily, and K = S0y_, Kily,, where J; = [ti, ti41). 


(iii) We get from (i) for step functions H, K as in (ii) 


t 
[ff HoKod(M,N)ol < SOMME 
: a 


iA 


a |HiKi|((M, M);**)'/?((M, M),i#?)1/? 


IA 


(D> HP (M, My)? (37 KP? (N, Nyt)? 


(f H3d(M, M))'/? . f K3a(N,N))¥/? , 


where we have used Cauchy-Schwarz inequality for the summation over 
i. O 
Definition. Denote by H” the set of £?-martingales which are A;:-adapted 


and satisfy 
IIMilqy, = (sup E[M2])"/? < oo. 
t 


Call H? the subset of continuous martingales in H? and with H? the subset 
of continuous martingales which are vanishing at zero. 


Given a martingale M € H?, we define £?(M) the space of progressively 
measurable processes K such that 


NE) = aif K2d(M, M),] < 00. 


Both H? and £?(M) are Hilbert spaces. 
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Lemma 4.18.2. The space H? of continuous £L? martingales is closed in H? 
and so a Hilbert space. Also H@ is closed in H? and is therefore a Hilbert 
space. , 


Proof. Take a sequence M‘) in H? converging to M € H?. By Doob’s 
inequality 
E[(sup [Mp — M,|)?] < 4\|M — Miya. 
t 
We can extract a subsequence, for which sup, [agin — M,| converges point 


wise to zero almost everywhere. Therefore M € H*. The same argument 
shows also that H? is closed. O 


Proposition 4.18.3. Given M € H? and K € £L?(M). There exists a unique 
element {; KdM ¢ H2 such that 


t t 
< i. KdM,N >= / Kd(M, N) 
0 0 


for every N € H?. The map K » if KdM is an isometry form £L?(M) to 
HG. 


Proof. We can assume M € Hp since in general, we define f K dM = 
Jo K d(M — M). 


(i) By the Kunita-Watanabe inequality, we have for every N € HZ 
t 
BE Kad(M,N)oll <M Ip “UK lea 


The map ; 
Ne El i K,) d(M, N),] 


is therefore a linear continuous functional on the Hilbert space H?. By 
Riesz representation theorem, there is an element f K dM € H@ such that 


aah K, dM,)N;] = Bi [Kata N).] 


for every N € H?. 
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(ii) Uniqueness. Assume there exist two martingales L, L’ € H2 such that 
(L, N) = (L’,N) for all N € H3. Then, in particular, (L — L’, L— L’) = 0, 
from which L = L’ follows. 


(iii) The integral K +> i K dM is an isometry because 


t 
Nf Kany 


BI [ ” Kk, dM.) 


B| | ” K? d(M, M)| 


2 
WKB: 
: CO) 


Definition. The martingale in K, dM, is called the Ito integral of the 
progressively measurable process K with respect to the martingale M. We 
can take especially, K = f(M), since continuous processes are progressively 
measurable. If we take M = B, Brownian motion, we get the already 
familiar Ito integral. 


Definition. An A; adapted right-continuous process is called a local martin- 
gale if there exists a sequence T,, of increasing stopping times with T,, — co 
almost everywhere, such that for every n, the process X TL er 50} is a uni- 
formly integrable A;-martingale. Local martingales are more general than 
martingales. Stochastic integration can be defined more generally for local 
martingales. 


We show now that Ito’s formula holds also for general martingales. First, 
a special case, the integration by parts formula. 


Theorem 4.18.4 (Integration by parts). Let X,Y be two continuous mar- 
tingales. Then 


t t 
XY: — XoYo = | X, dY, +f Y,dX,+(X,Y)t 
0 0 


and especially 


t 
x?-x3=2 [ X,dX,+(X,X)e . 


Proof. The general case follows from the special case by polarization: use 
the special case for X + Y as well as X and Y. 

The special case is proved by discretisation: let A = {to,ti,...,tn} bea 
finite discretisation of [0, ¢]. Then 


n n 
pe. cer _ Kay = X? _ XG ~ 2S> Xt (Xe: — Xt, ) “ 


i=1 t=]. 
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Letting |A| going to zero, we get the claim. O 


Theorem 4.18.5 (Ito formula for martingales). Given vector martingales 
M =(M°),...,M) and X and a function f € C2(R4,R). Then 


t t . : 
FX)—F(Xo) = f VAX) ae D ff & bas fees(Xo) aME, MP) 
ij 


Proof. It is enough to prove the formula for polynomials. By the integration 
by parts formula, we get the result for functions f(z) = xig(z), if it is 
established for a function g. Since it is true for constant functions, we are 
done by induction. 0 


Remark. The usual Ito formula in one dimensions is a special case 


LX) ~ H(%0) = f 716) 4B, + 5 [F"%) as. 


In one dimension and if M; = B; is Brownian motion and X; is a martin- 
gale, we have We will use it later, when dealing with stochastic differential 
equations. It is a special case, because (B;, B;) = t, so that d( By, Bt) = dt. 


Example. If f(z) = x*, this formula gives for processes satisfying Xo = 0 


t 
i 
xia= | X5dBs + 5t. 
0 


This formula integrates the stochastic integral if X, dB, = X?/2—t/2. 


Example. If f(x) = log(x), the formula gives 
log(X;/Xo) = | aB,/X,— 5 | ds/X?. 
0 0 


4.19 Stochastic differential equations 


We have seen earlier that if B, is Brownian motion, then X = f (B,t) = 


erBr-a°t/2 is a martingale. In the last section we learned using Ito’s formula 
and and sAf + f =0 that 


t 
[ oxXsam= x1. 
0 


We can write this in differential form as 


dX, = aX; dM, Xo =, 
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This is an example of a stochastic differential equation (SDE) and one 


would use the notation 
dX. 


dM 
if it would not lead to confusion with the corresponding ordinary differential 


equation, where M is not a stochastic process but a variable and where the 


solution would be X = e®®. Here, the solution is the stochastic process 
X= eaBr- —a7t/2. 


=axX 


Definition. Let B, be Brownian motion in R?. A solution of a stochastic 
differential equation 


dX, = f (Xt, B:) . dB; + g( Xz) dt 3 


is a R¢-valued process X; satisfying 


t t 
Xt = / J (Xa; Bs) fe dB, +f g(Xs) ds ’ 
0 ) 


where f : R¢4 x R¢ > R¢ and g: R¢ x Rt > R. 


As for ordinary differential equations, where one can easily solve separable 
differential equations dx/dt = f(x) + g(t) by integration, this works for 
stochastic differential equations. However, to integrate, one has to use an 
adapted substitution. The key is Ito’s formula (4.18.5) which holds for 
martingales and so for solutions of stochastic differential equations which 
is in one dimensions 


F(X) — f(Xo) = ice )dX,+= sf 2%) de, X0). 


The following multiplication table” for the product (-,-) and the differen- 
tials dt, dB; can be found in many books of stochastic differential equations 
[2, 46, 66] and is useful to have in mind when solving actual stochastic dif- 
ferential equations: 


Example. The linear ordinary differential equation dX /dt =rX with solu- 
tion X; = e™'Xo has a stochastic analog. It is called the stochastic popula- 
tion model. We look for a stochastic process X; which solves the SDE 


dX 


ae = =rX; + aX:(; . 


Separation of variables gives 


= = rtdt + acdt 
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and integration with respect to t 


t 
dX, 
/ Xt OB . 


In order to compute the stochastic integral on the left hand side, we have to 
do a change of variables with f(X) = log(x). Looking up the multiplication 
table: 

(dX¢, dX;) = (rX,dt + aX,dB,, rX¢dt + a’ X,dB;) = a* X?dt : 


Ito’s formula in one dimensions 


t ire 
F(X) — F(%o) = fH) ax. + 5 fH") Ke Xe) 
0 0 
gives therefore 
log(X;/Xo) = [ dX,/X;— = h a’ds 
0 2 Jo 
so that f dX,/X, = a*t/2 + log(X:/Xo). Therefore, 


a’t/2 + log(X;/Xo) = rt + aBy 


and so X; = Xoe™t— 0 t/2+aBe, This process is called geometric Brownian 
motion. We see especially that X = X/2+ X€ has the solution X; = eB, 


Figure. Solutions to the stochastic Figure. Solutions to the stochastic 
population model for r > 0. population model for r < 0. 


Remark. The stochastic population model is also important when modeling 
financial markets. In that area the constant r is called the percentage drift 
or expected gain and a is called the percentage volatility. The Black-Scholes 
model makes the assumption that the stock prices evolves according to 
geometric Brownian motion. 
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Example. In principle, one can study stochastic versions of any differential 
equation. An example from physics is when a particle move in a possibly 
time-dependent force field F(x, t) with friction 6 for which the equation 
without noise is 

£ = —bé+ F(z,t). 


If we add white noise, we get a stochastic differential equation 
&= —be + F(z,t)+ a(t). 


For example, with X = z and F = 0, the function v(t) satisfies the stochas- 
tic differential equation 
dx; 


— = -bX 
dt 2+ at , 


which has the solution 
Xt = eo + aB é 
With a time dependent force F(z, t), already the differential equation with- 


out noise can not be given closed solutions in general. If the friction constant 


b is noisy, we obtain 
dX; 


“dt 


which is the stochastic population model treated in the previous example. 


= (—b+ a) Xt 


Example. Here is a list of stochastic differential equations with solutions. 
We again use the notation of white noise ¢(t) = a8 which is a generalized 
function in the following table. The notational replacement dB, = (dt is 
quite popular for more applied sciences like engineering or finance. 


Stochastic differential equation 
4X, = B,C(t) X, = B2: /2=(B2- 1/2 
X, =: B3 : /3 = (B3 — 3B,)/3 


X, =: Bi: /4 = (Bi — 6B? + 3)/4 


Xi= exBr—a t/2 
X,= eTttaB,—a't/2 


Remark. Because the Ito integral can be defined for any continuous martin- 
gale, Brownian motion could be replaced by an other continuous martingale 
M leading to other classes of stochastic differential equations. A solution 
must then satisfy 


£X,=rXi+aX,C(t) 


t t 
Xx, = i ff (Xs, Ms, 3) i dM, +f g( Xs, 8) ds . 
0 0 


_Example. 
Xt = eeMr—a?(X,X) 4/2 


is a solution of dX; = aMidM;, Mp = 1. 
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Remark. Stochastic differential equations were introduced by Ito in 1951. 
Differential equations with a different integral came from Stratonovich but 
there are formulas which relating them with each other. So, it is enough 
to consider the Ito integral. Both versions of stochastic integration have 
advantages and disadvantages. Kunita shows in his book [55] that one can 
view solutions as stochastic flows of diffeomorphisms. This brings the topic 
into the framework of ergodic theory. 


For ordinary differential equations < = f(z, t), one knows that unique solu- 
tions exist locally if f is Lipshitz continuous in x and continuous in t. The 
proof given for 1-dimensional systems generalizes to differential equations 
in arbitrary Banach spaces. The idea of the proof is a Picard iteration of 
an operator which is a contraction. Below, we give a detailed proof of this 
existence theorem for ordinary differential equations. For stochastic differ- 
ential equations, one can do the same. We will do such an iteration on the 
Hilbert space H fo, t} of C? martingales X having finite norm 


\|X||7 = E[sup X?] . 
t<T 


We will need the following version of Doob’s inequality: 


Lemma 4.19.1. Let X be a £? martingale with p > 1. Then 


Elsup |X-)?] < (==>)? - Ell Xel"] . 


Pp 
p-1 


Proof. We can assume without loss of generality that X is bounded. The 
general result follows by approximating X by X Ak with k — oo. 
Define X* = sup, <; |Xs|?. From Doob’s inequality 


P[X > A] < El|Xe| - 1x>,] 


we get 


E(|X*"] = Ef [pray 

= BEL Pes) da] 

= Bf px per 2 aa 

ra aif prP- Bl Xz] - Lx->a] dA] 
= pbx “yp? dd 


= Pople (Pr), 
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Holder’s inequality gives 
BUX *P] SESE x PO PEL X PY? 


and the claim follows. O 


Theorem 4.19.2 (Local existence and uniqueness of solutions). Let M be a 
continuous martingale. Assume f(x,t) and g(z,t) are continuous in ¢ and 
Lipshitz continuous in x. Then there exists T > 0 and a unique solution 
X, of the SDE 

dX = f(x,t) dM + g(z,t) ds 


with initial condition Xp = Xo. 


Proof. Define the operator 
t t 
six) =f s(s,X.) ame + | o(5,X,) as 
0 0 


on £L?-processes. Write S(X) = S 1(X)+S2(X). We will show that on some 
time interval (0, T], the map S is a contraction and that S"(X) converges 
in the metric |||X — Y|||r = E[sup,<7(X5 — Yz)?], if T is small enough to 
a unique fixed point. It is enough that for i = 1,2 


Il]Si(X) — Si(¥)I llr < (1/4) - ||X -¥|lr 
‘then S is a contraction 

NIS(X) — SY) | Ilr < (1/2) - [|X - ¥ |r. 
By assumption, there exists a constant K , such that 


If(t,w) — f(t,w')| < Aiea | i | 


(i) [IIS1(X) — Si(¥ lll = II Jo F(8, Xs) — f(s, Ye) dMolllr < (1/4) + |I|X — 
Y|||r for T small enough. 
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Proof. By the above lemma for p = 2, we have 


(NS1(X) — A) IIr 


Bisup 7 f(s,X) — f(s, ¥) dM,)"| 


lA 


T 
aE [ f(t, X) - f(t Y) dM)? 
T 
= 4E(( [ f(t, X) — f(t, Y))? d(M, Md 
T 
< Ke| | sup |X5 — Yel" dt] 
0 s<t 


T 
4K i IX —Yllle a 
(1/4) “IX —Ylllr 


where the last inequality holds for T small enough. 


IA 


(ii) |I\S2(X) — S2(V llr = III ff 9(8, Xs) — g(s, Ys) dsillar < (1/4) -IIX — 
Y|||x for T small enough. This is proved for differential equations in Banach 
spaces. 

The two estimates (i) and (id) prove the claim in the same way as in the 
classical Cauchy-Picard existence theorem. 


Appendix. In this Appendix, we add the existence of solutions of ordinary 
differential equations in Banach spaces. Let X be a Banach space and J an 
interval in R. The following lemma is useful for proving existence of fixed 
points of maps. 


a 


Lemma 4.19.3. Let X = B,(zo) C ¥ and assume ¢ is a differentiable map 
¥ = X. If for all z € X, ||D¢(z)|| < |A| < 1 and 


\|#(xo) — zoll < (1-A)-7 
then ¢ has exactly one fixed point in X. 


ee ee ee SS SSS 


Proof. The condition ||x — zo|| < 7 implies that 


\|o(x) — xoll < l(a) — o(20) || + Ile(20) — zoll S Ar + (Lear =e 


The map ¢ maps therefore the ball X into itself. Banach’s fixed point 
theorem applied to the complete metric space X and the contraction ¢ 
implies the result. O 


Let f be a map from I x X to X . A differentiable map u: J > ¥ of an 
open ball J C I in 4 is called a solution of the differential equation 


t= f(t,@) 
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if we have for all t € J the relation 


u(t) = f(t, u(t) - 


Theorem 4.19.4 (Cauchy-Picard Existence theorem). Let f:Ix¥ —24 
be continuous in the first coordinate and locally Lipshitz continuous in the 
second. Then, for every (to, Zo) € I x 4, there exists an open interval J c I 
with midpoint to, such that on J, there exists exactly one solution of the 
differential equation z = f(t, x). 


Proof. There exists an interval J(to,a) = (to — a,tp + a) C I and a ball 
B(ao,5), such that 


M = sup{||Ff(t,2)|| | (t,2) € J(to,a) x B(ao, b)} 


as well as 


k= sup = fl | (t, 21), (t,22) € J(to,a)x B(zo, 6), 21 4 x2} 


are finite. Define for r < a the Banach space 
. X, = C(I(to,r), &) = {y: T(to,r) > X, y continuous} 


with norm 


llyll= sup |ly(4)II 
teT(to,r) 


Let V,,, be the open ball in 7, with radius b around the constant map 
t ++ Zo. For every y € V,, we define 


t 
o(y) : t+ 29 +f f(s, y(s))ds 


which is again an element in 1. We prove now, that for r small enough, 
@ is a contraction. A fixed point of ¢ is then a solution of the differential 
equation ¢ = f(t, x), which exists on J = J,(to). For two points y;, y2 € Vy, 
we have by assumption 


lIf(s, (8) — F(s, 2(s))I] SK - Ilya (s) — yo(s)|l Sk [lyn — yall 


for every s € J,. Thus, we have 


lea) — o(y2)II 


HI 


1 i f(s,u1(s)) — f(s, 42(8)) da 


t 

< i IIf(s, ys(s)) — F(s, yo(s))I| ds 
to 

< kr-|ly — yell . 
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On the other hand, we have for every s € J; 


If(s,y(s))Il <M 


and so 


\16(z0) — oll = II / f(s,20(s)) dsl] < / ILf(s,20(s))|| ds < Mer. 


We can apply the above lemma, if kr < 1 and Mr < b(1 — kr). This is 
the case, if r < b/(M + kb). By choosing r small enough, we can get the 
contraction rate as small as we wish. QO 


Definition. A set X with a distance function d(z, y) for which the following 
properties 

(i) d(y, x) = d(x, y) > 0 for all z,y € X. 

(ii) d(x, z) = 0 and d(z,y) > 0 for x # y. 

(iii) d(a,z) < d(x,y) + d(y, z) for all x,y, z. hold is called a metric space. 


Example. The plane R? with the usual distance d(x, y) = |x —y|. An other 
metric is the Manhattan or taxi metric d(z,y) = |z1 — y1| + |z2 — yal. 


Example. The set C((0, 1]) of all continuous functions z(t) on the interval 
(0, 1] with the distance d(x, y) = max; |x(t) — y(t)| is a metric space. 


Definition. A map ¢: X — X is called a contraction, if there exists A<1 
such that d(¢(z), ¢(y)) < A-d(z,y) for all x,y € X. The map ¢ shrinks the 
distance of any two points by the contraction factor 4. 


Example. The map ¢(z) = 52 + (1,0) is a contraction on R?. 


Example. The map ¢(zx)(t) = sin(t)z(t) +t is a contraction on C((0, 1) 
because [6(z)(t) — 4(y)(t)| = |sin(é)] - ka(é) — y(#)| < sin(1)- [2(¢) - yl. 


Definition. A Cauchy sequence in a metric space (X,d) is defined to be a 
sequence which has the property that for any € > 0, there exists no such 
that |an — Im| < € for n > no,m > No. 

A metric space in which every Cauchy sequence converges to a limit is 
called complete. 


Example. The n-dimensional Euclidean space 
(R", d(z,y) =|z-y| = a? +++++22) 
is complete. The set of rational numbers with the usual distance 


(Q, d(x, y) = |t — yl) 


is not complete. 
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Example. The space C[0, 1] is complete: given a Cauchy sequence Zn, then 
In(t) is a Cauchy sequence in R for all t. Therefore z,(t) converges point 
wise to a function x(t). This function is continuous: take € > 0, then |x(¢t) — 
x(s)| < |x(t) — 2n(t)| + |zn(t) — yn(s)| + lyn(s) — y(s)| by the triangle 
inequality. If s is close to t, the second term is smaller than e/3. For large 
n, |x(t) — tn(t)| < €/3 and |yn(s) — y(s)| < €/3. So, |x(t) — x(s)| < € if 
|t — s| is small. 


Theorem 4.19.5 (Banachs fixed point theorem). A contraction ¢ in a com- 
plete metric space (X,d) has exactly one fixed point in X. 


Proof. (i) We first show by induction that 


d(¢"(x), "(y)) <A” d(z,y) 


for all n. 


(ii) Using the triangle inequality and )>, A* = (1 — A)~', we get for all 
rEX, 


n-1 n-1 
d(x, 62) < J d(dke, d2) < J Male, $(a)) < dle, o(@)) 
k=0 


k=0 
(iii) For all c € X the sequence z, = $"(z) is a Cauchy sequence because 
by (i), (ii), 


1 
(an, lntk) <A” - d(x, 2K) < A”: at d(x, 21) . 
By completeness of X it has a limit < which is a fixed point of ¢. 


(iv) There is only one fixed point. Assume, there were two fixed points 7, y 
of ¢. Then 


d(Z, 9) = d($(Z), (9) < r-d(Z,9) . 


This is impossible unless ¢ = 4. oO 
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Chapter 5 


Selected Topics 


5.1 Percolation 


Definition. Let e; be the standard basis in the lattice Z¢. Denote with L? 
the Cayley graph of Z¢ with the generators A = {e1,...,eq }. This graph 
L@ = (V,E) has the lattice Z? as vertices. The edges or bonds in that 
graph are straight line segments connecting neighboring points x,y. Points 
satisfying |e — y| = D4, [ai - wl = 1. 


Definition. We declare each bond of L4¢ to be open with probability p € 
(0, 1] and closed otherwise. Bonds are open ore closed independently of all 
other bonds. The product measure P, is defined on the probability space 
2 = T] cz {0, 1} of all configurations. We denote expectation with respect 
to P, with E,[-]. 


Definition. A path in L? is a sequence of vertices (0, 71,.-- ,£p,) such that 
(xi, Zi+1) = €; are bonds of L?. Such a path has length n and connects xo 
with z,. A path is called open if all its edges are open and closed if all its 
edges are closed. Two subgraphs of L? are disjoint if they have no edges 
and no vertices in common. 


Definition. Consider the random subgraph of L? containing the vertex set 
Z4 and only open edges. The connected components of this graph are called 
open clusters. If it is finite, an open cluster is also called a lattice animal. 
Call C(x) the open cluster containing the vertex 2. By translation invari- 
ance, the distribution of C(x) is independent of x and we can take z = 0 
for which we write C(0) =C. 
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Figure. A lattice animal. 


Definition. Define the percolation probability @(p) being the probability 
that a given vertex belongs to an infinite open cluster. 


9(p) = PIC] = co] =1~ S$ Pic} =n). 


n=1 


One of the goals of bond percolation theory is to study the function 6(p). 


Lemma 5.1.1. There exists a critical value p. = pc(d) such that 6(p) = 0 
for p < p, and 6(p) > 0 for p > pe. The value d15 pc(d) is non-increasing 
with respect to the dimension p,(d +1) < p,(d). 


Proof. The function p ++ 6(p) is non-decreasing and 6(0) = 0,A(1) = 1. We 
can therefore define 
Pc = inf {p € [0,1] | @(p) > 0}. 


The graph Z? can be embedded into the graph Z" for d < d’ by realizing Z@ 
as a linear subspace of Z” parallel to a coordinate plane. Any configuration 
in L? projects then to a configuration in L?. If the origin is in an infinite 
cluster of Z4, then it is also in an infinite cluster of Z4’. Oo 


Remark. The one-dimensional case d = 1 is not interesting because p, = 1 
there. Interesting phenomena are only possible in dimensions d > 1. The 
planar case d = 2 is already very interesting. 


Definition. A self-avoiding random walk in L¢ is the process Sy obtained 
by stopping the ordinary random walk S,, with stopping time 


T(w) = inf{n EN | w(n) =w(m),m <n}. 


Let o(n) be the number of self-avoiding paths in L¢ which have length n. 
The connective constant of L@ is defined as 


X(d) = lim o(n)/”. 
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Remark. The exact value of A(d) is not known. But one has the elementary 
estimate d < \(d) < 2d — 1 because a self-avoiding walk can not reverse 
direction and so o(n) < 2d(2d — 1)""! and a walk going only forward 
in each direction is self-avoiding. For example, it is known that (2) € 
(2.62002, 2.69576] and numerical estimates makes one believe that the real 
value is 2.6381585. The number c,, of self-avoiding walks of length n in L? is 
for small values c; = 4,c2 = 12,c3 = 36,c4 = 100, c5 = 284, cg = 780,c7 = 
2172,.... Consult [62] for more information on the self-avoiding random 
walk. 


Theorem 5.1.2 (Broadbent-Hammersley theorem). If d > 1, then 


0 <A(d)~* < pe(d) < pe(2) <1. 


Proof. (i) pe(d) > A(d)~?. 
Let N(n) < o(n) be the number of open self-avoiding paths of length n in 
L”. Since any such path is open with probability p”, we have 


Ep[N(n)| = p"o(n) - 


If the origin is in an infinite open cluster, there must exist open paths of 
all lengths beginning at the origin so that 


0(p) < Py[N(n) > 1] < Bp[N(n)] = po(n) = (pr(d) + 0(1))” 
which goes to zero for p < \(p)~!. This shows that p-(d) > A(d)~'. 


(ii) p-(2) < 1. 

Denote by L? the dual graph of L? which has as vertices the faces of L? and 
as vertices pairs of faces which are adjacent. We can realize the vertices as 
Z? + (1/2, 1/2). Since there is a bijective relation between the edges of L? 
and L? and we declare an edge of L? to be open if it crosses an open edge 


in L? and closed, if it crosses a closed edge. This defines bond percolation 
on L?. 


The fact that the origin is in the interior of a closed circuit of the dual 
lattice if and only if the open cluster at the origin is finite follows from the 
Jordan curve theorem which assures that a closed path in the plane divides 
the plane into two disjoint subsets. 


Let p(n) denote the number of closed circuits in the dual which have length 
n and which contain in their interiors the origin of L?. Each such circuit 
contains a self-avoiding walk of length n ~ 1 starting from a vertex of the 
form (k + 1/2,1/2), where 0 < k < n. Since the number of such paths 7+ is 
at most no(n — 1), we have 


p(n) < no(n — 1) 
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and with g=1-—p 


ys Py is closed] < ‘2 q’no(n—1)= pS. qn(qA(2) +o(1))?71 
7 n=1 n=1 


which is finite if gA(2) < 1. Furthermore, this sum goes to zero if g > 0 so 
that we can find 0 < 6 < 1 such that for p> 6 


So Ply is closed] < 1/2. 
. 


We have therefore 


P{|C| = oo] = P[no 7 is closed] > 1 — >oPiy is closed] > 1/2 
7 


so that p.(2) < 6 <1. O 


Remark. We will see below that even p,(2) < 1 — A(2)~!. It is however 
known that p,(2) = 1/2. 


Definition. The parameter set p < pz, is called the sub-critical phase, the 
set p > p, is the supercritical phase. 


Definition. For p < pc, one is also interested in the mean size of the open 
cluster 

x(p) = E,||C]] . 
For p > pc, one would like to know the mean size of the finite clusters 


x!(p) = Ep[|C| | |C| < oo] . 


It is known that y(p) < 00 for p < p, but only conjectured that xf (p) < 00 
for p > De. 

An interesting question is whether there exists an open cluster at the critical 
point p = p.. The answer is known to be no in the case d = 2 and generally 
believed to be no for d > 3. For p near pz it is believed that the percolation 
probability 6(p) and the mean size x(p) behave as powers of |p — pz]. It is 
conjectured that the following critical exponents 


~ jim 08 x() 


em ae log |p — pe| 
. log 6(p) 
= lim ————— 
B P\pc log |p — pe| 
> 
6 = — jim SPalIGl2 ny 


n—00 logn 
exist. 
Percolation deals with a family of probability spaces (2,.A, P,), where 


Q = {0, 1}E" is the set of configurations with product o-algebra A and 
product measure P, = (p,1— p)L*. 


5.1. Percolation 279 


Definition. There exists a natural partial ordering in 2 coming from the 
ordering on {0,1}: we say w < w’, if w(e) < w’(e) for all bonds e € L?. 
We call a random variable X on (,A,P) increasing if w < w’ implies 
X(w) < X(w’). It is called decreasing if —X is increasing. As usual, this 
notion can also be defined for measurable sets A € A: a set A is increasing 
if 14 is increasing. 


a 


Lemma 5.1.3. If X is a increasing random variable in £'(Q; BNL" (O, Pp); 
then 


Ep[X] < Eq/X] 
ifp<q. 


Proof. If X depends only on a single bond e, we can write E,[X] = pX(1)+ 
(1 — p)X(0). Because X is assumed to be increasing, we have $Eo[X eo 
X(1) — X(0) > 0 which gives E,[X] < Eq[X] for p < @. If X depends only 
on finitely many bonds, we can write it as a sum X = ya X; of variables 
X; which depend only on one bond and get again 


= ErX = (X(t) — X:(0)) 20. 


i=1 


In general we approximate every random variable in £1(9, Pp) NL’ (Q, Pq) 
by step functions which depend only on finitely many coordinates X;. Since 
Ep[Xi] > Ep[X] and E,[Xi] > E,[X], the claim follows. Oo 


The following correlation inequality is named after Fortuin, Kasterleyn and 
Ginibre (1971). 


Theorem 5.1.4 (FKG inequality). For increasing random variables X,Y € 
£2(Q,Pp,), we have 
Ep[XY] > Ep[X] - Bp[¥] 


eee 


Proof. As in the proof of the above lemma, we prove the claim first for ran- 
dom variables X which depend only on n edges €1, €2,---,€n and proceed 
by induction. 


(i) The claim, if X and Y only depend on one edge e. 
We have 


(X(w) — XW')(Y(w) - Y') 2 0 
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since the left hand side is 0 if w(e) = w’(e) and if 1 = w(e) = w'(e) = 0, both 
factors are nonnegative since X,Y are increasing, if 0 = w(e) = w'(e) = 1 
both factors are non-positive since X,Y are increasing. Therefore 


DY (X&) - Xw')(¥&)-¥w'))Pp[w(e) = o}Pp[w(e) = o'| 
a,0/€{0,1} 
2(Ep[XY] — E,[X]E,[Y]) . 


0 


IA 


(ii) Assume the claim is known for all functions which depend on k edges 
with k < n. We claim that it holds also for X ,Y depending on n edges 
€1,€2,...,€n. 

Let Aj, = A(e1,...e%) be the o-algebra generated by functions depending 
only on the edges e,. The random variables 


depend only on the e;,...,e, and are increasing. By induction, 
Ep[Xn-1¥n-1] 2 Ep[Xn—1JEp[Yn-1] “ 


By the tower property of conditional expectation, the right hand side is 
E,[X]E,[Y]. For fixed e,... »€n—1, we have (XY)n_1 > Xn—1Yp_1 and so 


E,[XY] = Ep[(X¥)n-1] > Ep[Xn—1¥n—1] - 


(iii) Let X,Y be arbitrary and define X,, = E,[X|An], Yn = Ep[Y|An]. We 
know from (i) that Ep[X,Y,] > Ep[Xn|Ep[Yn]. Since X, = E[X|An] and 
Yn = E[X|A,] are martingales which are bounded in £7(Q, Pp), Doob’s 
convergence theorem (3.5.4) implies that X, + X and Y, > Y in CL? and 
therefore E[X,,] — E[X] and E[Y,] > E[Y]. By the Schwarz inequality, we 
get also in £! or the £L? norm in (Q, A, Pp) 

[Xn¥n — XY||1 (Xn — X)¥alli + ||X(¥n - Y)Iha 
[|Xn ~ X]2[l¥nll2 + ||Xllall¥n — Y|l2 
C(l|Xn — Xl2 + |[¥n — ¥|l2) > 0 


IA IA IA 


where C = max(||X|l2,||Y||2) is a constant. This means Bp|XnYn] > 
E,[XY]. Oo 


Remark. It follows immediately that if A, B are increasing events in 2, 
then P,[A/N B] > P,[A] - P,[B]. 


' Example. Let I’; be families of paths in L? and let A; be the event that 
some path in I’; is open. Then A; are increasing events and so after applying 
the inequality k times, we get 


k k 
Poll) Aj] = [] Pld ; 
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We show now, how this inequality can be used to give an explicit bound for 
the critical percolation probability p- in L2. The following corollary belongs 
still to the theorem of Broadbent-Hammersley. 


Corollary 5.1.5. 
pe(2) < (1 — X(2)*) . 


ee ee ee ee nl 


Proof. Given any integer N €N, define the events 


Fy = {Anoclosed path of length < N in L} 
Gy = {Anoclosed path of length > N in | Fog a 


We know that Fy NG Cc {|C| = 00}. Since Fy and Gy are both increas- 
ing, the correlation inequality says P,[Fv NG n] > PplFw] - Ppl[Gn]. We 
deduce : 


6(p) = Pp[IC| = 00] = Pp[Fiv N Gn] 2 Pp[F] - PelGal - 
If (1 — p)A(2) < 1, then we know that 


P,{GQ] < So (1 —p)"no(n - 1) 
n=N 


which goes to zero for N — oo. For N large enough, we have therefore 
Pp[Gn] > 1/2. Since also P,[F1y] > 0, it follows that 6» > 0, if (1—p)A(2) < 
1 or p < (1 — X(2)~!) which proves the claim. oO 


Definition. Given A € A and w € 2. We say that an edge e € L¢ is pivotal 
for the pair (A,w) if La(w) # 1a(wWe), where we is the unique configuration 
which agrees with w except at the edge e. 


vrs 


Theorem 5.1.6 (Russo’s formula). Let A be an increasing event depending 
only on finitely many edges of L¢. Then 


d 
=P lA] = BolN(A)] 


where N(A) is the number of edges which are pivotal for A. 


NNN a 


Proof. (i) We define a new probability space. 
The family of probability spaces (Q, A, Py), can be embedded in one prob- 
ability space 

({0, 1°, B((0, 1]*"),P) 5 
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where P is the product measure dz“". Given a configuration 7 € 0, ye" and 
p € (0, 1], we get a configuration in Q by defining n,(e) = 1 if n(e) < p and 
Np = O else. More generally, given p € (0, 1", we get configurations n,(e) = 
1 if n(e) < p(e) and np = 0 else. Like this, we can define configurations 


with a large class of probability measures Pp = J] .<,4(p(e), 1 — p(e)) with 
one probability space and we have 


PplA] = Plnp € A]. 


(ii) Derivative with respect to one p(f). 
Assume p and p’ differ only at an edge f such that p(f) < p'(f). Then 
{np € A} C {np € A} so that 


Pp [A] — Pp[A] = Plnp: € A] — P[np € A] 

Plnp: € A; np ¢ Al 

(p'(f) — p(f))Pplf pivotal for A] . 
Divide both sides by (p’(f) — p(f)) and let p'(f) — p(f). This gives 


ies 
dp(f) 
(iii) The claim, if A depends on finitely many edges. If A depends on finitely 


many edges, then P,[A] is a function of a finite set {p(f;)}"2, of edge 
probabilities. The chain rule gives then 


P,[A] = Pp[f pivotal for A] . 


d =“ @ 
rs de 7 Do Spf PAlle=(o.70.0 


»- Pp[fi pivotal for A] 


i=1 


E,[N(A)] - 


(iv) The general claim. 
In general, define for every finite set F C E 


Pr(e) =P + lree ry 
where 0 < p< p+6 <1. Since A is increasing, we have 
Pos s{A] > Poe [A] 
and therefore 


5 (Poss[A] — Pp[Al) > 5(PorlAl ~ PplAl) > > Pole pivotal for A] 
ecF 


as 6 — 0. The claim is obtained by making F larger and larger filling out 
E. oO 
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Example. Let F = {e1,€2,---,€m} C B bea finite set in of edges. 
A = {the number of open edges in F is = k}. 


An edge e € F is pivotal for A if and only if A \ {e} has exactly k — 1 open 
edges. We have 


P,[e is pivotal] = ( oa. ) pp" 


so that by Russo’s formula 

d = fa ee ey m-1 \ 4-1 m—k 
gl = 2 Pole is pivotal] = m ( k—1 \p (1—p) : 
Since we know Po[A] = 0, we obtain by integration 


Ppldl= 30 ("7 20-2)". 


l=k 


Remark. Jf A does no more depend on finitely many edges, then P,[A] 
need no more be differentiable for all values of p. 


Definition. The mean size of the open cluster is x(p) = Ep||C|]. 


Theorem 5.1.7 (Uniqueness). For p < pc, the mean size of the open cluster 
is finite x(p) < 00. 


A 


The proof of this theorem is quite involved and we will not give the full 
argument. Let S(n, x) = {y € Z* | |x -y| = pan \x;| < n} be the ball of 
radius n around z in Z? and let An be the event that there exists an open 
path joining the origin with some vertex in 65 (n, 0). 


a 


Lemma 5.1.8. (Exponential decay of radius of the open cluster) If p < pe, 
there exists a, such that P,[An] < e7"”. 


Proof. Clearly, |S(n,0)| < Ca: (n+ 1)4 with some constant Cg. Let M = 
max{n | An occurs }. By definition of pe, if p < pe, then P,[M < co] = 1. 
We get 


EpliCl] < S_EplIC| | M =n) -Pp[M =n} 
< bs |S(n, 0)|Pp[An] 
< 


S>Ca(n 1) ten = oo, 
n 
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ma) 


Proof. We are concerned with the probabilities 9p(n) = P, [An]. Sine A, 
are increasing events, Russo’s formula gives 


9p() = E,[N(A,)] , 


where N(A,,) is the number of pivotal edges in A,. We have 


g,(n) = > P,[e pivotal for A] 


se Py [e open and pivotal for A] 


e 


Dy =P ald A {e pivotal for A}] 
a Pol N {e pivotal for A}|A] - Pp [A] 


e 


I 


DS EeIN(A) | Al PA 


Y SEHIN (A) | Al-gp(n) 


e 


so that 
Ip(”) 
9p(n) 
By integrating up from a to G, we get 


1 
oy peel (An) | An] : 


aa 
ga(n) = go(n) exp(— / sEnlV(An) | An] do) 
B : 
< go(n)exp(~ f EpIN (An) | An] dp) 
B 
<exp(- [ By[N(An) | An] dp) 


One needs to show then that E,[N(An) |An] grows roughly linearly when 
P < Pc. This is quite technical and we skip it. O 


Definition. The number of open clusters per vertex is defined as 


K(®) = EplICl“!] = =P, [IO] =n). 


n=1 


Let B, the box with side length 2n and center at the origin and let K,, be 
the number of open clusters in B,,. The following proposition explains the 
name of k. 
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Proposition 5.1.9. In £'(Q,.A, P,) we have 


Ky, /|Bn| > K(p) . 


Proof. Let C,(x) be the connected component of the open cluster in By, 
which contains x € Z*. Define I(x) = |C(x)|-!. 


(i) Deen Tule) = Ka. | 

Proof. If © is an open cluster of B,,, then each vertex x € © contributes 
|=|~* to the left hand side. Thus, each open cluster contributes 1 to the 
left hand side. 


(ii) ey > TEsT Yep, (x) where P(x) = |C(a)|-?. 
Proof. Follows from (i) and the trivial fact I(x) < I, (2). 


(iii) phy Crew, P(e) > Ep(P(0)] = (p). 

Proof. ['(x) are bounded random variables which have a distribution which 
is invariant under the ergodic group of translations in Z?. The claim follows 
from the ergodic theorem. 


iv) liminf,;.. 4 > x(p) almost everywhere. 
TBal 


Proof. Follows from (ii) and (iii). 


(v) ye eexin) Pn(z) < xeB(n) P (z) + Y~6B,, Pn(x), where z ~ Y means 
that x is in the same cluster as one of the elements y € Y Cc Z?. 


(vi) Tey Drew, Pale) < oy Deer, P(e) + BBal. C 


Remark. It is known that function x(p) is continuously differentiable on 
(0, 1]. It is even known that « and the mean size of the open cluster y(p) are 
real analytic functions on the interval [0,p.). There would be much more 
to say in percolation theory. We mention: 

The uniqueness of the infinite open cluster: 

For p > pe and if @(p.) > 0 also for p = pe, there exists a unique infinite 
open cluster. 

Regularity of some functions @(p) 

For p > pe, the functions #(p), x/(p),«(p) are differentiable. In general, 
@p) is continuous from the right. 

The critical probability in two dimensions is 1/2. 
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5.2 Random Jacobi matrices 


Definition. A Jacobi matrix with IID potential V.(n) is a bounded self- 
adjoint operator on the Hilbert space 


1?(Z) = tC snp Po9, BG, Listas } | eS zh =1 } 


k=—00 


of the form 


Lyu(n)= $2 ulm) +Vi(n)u(n) = (A+ V.)(u)(n) , 


|m—n|=1 


where V,,(n) are IID random variables in £°. These operators are called 
discrete random Schrédinger operators. We are interested in properties of 
L which hold for almost all w € Q. In this section, we mostly write the 
elements w of the probability space (0,4, P) as a lower index. 


Definition. A bounded linear operator L has pure point spectrum, if there 
exists a countable set of eigenvalues \; with eigenfunctions ¢; such that 
Ld; = 19; and ¢; span the Hilbert space /?(Z). A random operator has 
pure point spectrum if L,, has pure point spectrum for almost all w € 9. 


Our goal is to prove the following theorem: 


CO 


Theorem 5.2.1 (Frdhlich-Spencer). Let V(n) are IID random variables with 
uniform distribution on [0,1]. There exists Ag such that for \ > Xo, the 
operator L, = A+ .-V, has pure point spectrum for almost all w. 


_—-—————_—— 


We will give a recent elegant proof of Aizenman-Molchanov following [94]. 


Definition. Given E € C \ R, define the Green function 
G..(m,n, E) = [(Ly — E)""]mn - 


Let 4 = yw be the spectral measure of the vector e9. This measure is 
defined as the functional C(R) > R, f + f(Lw)oo by f(Lu)oo = Elf (L)oo]. 
Define the function ; 
F(z) = [ ww 

RY-2z 
It is a function on the complex plane and called the Borel transform of the 
measure y. An important role will play its derivative 


ry. [ dp) 
r= | 
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Definition. Given any Jacobi matrix L, let La be the operator L + aPo, 
where Py is the projection onto the one-dimensional space spanned by do. 
One calls L, a rank-one perturbation of L. 


Theorem 5.2.2 (Integral formula of Javrjan-Kotani). The average over all 
specral measures djiq is the Lebesgue measure: 


| dug da =dE. 
R 


Proof. The second resolvent formula gives 
(La — 2)71 —(L—z)7* = —a(La - z)7}Po(L—2z)7?. 
Looking at 00 entry of this matrix identity, we obtain 
F,(z) — F(z) = —aFa(z) F(z) 
which gives, when solved for F,, the Aronzajn-Krein formula 


F(z) 


FGM) = nema). 


We have to show that for any continuous function f :C > C 


[ [te ala) da= f 4(@) a) 


and it is enough to verify this for the dense set of functions 
{fz(x) = (a —z)7} — (x +i)" |z€C\R}. 


Contour integration in the upper half plane gives fp fe(x) dx = 0 for 
Im(z) < 0 and 27 for Im(z) > 0. On the other hand 


J Fel )dua (2) = Fale) ~ Fal) 


which is by the Aronzajn-Krain formula equal to 


hoiaic 1 1 
ala)s= a+F(z)-! a+ F(-i)-!" 

Now, if +Im(z) > 0, then +ImF(z) > 0 so that tImF(z)~! < 0. This 

means that h(a) has either two poles in the lower half plane if Im(z) < 0 

or one in each half plane if Im(z) > 0. Contour integration in the upper 

half plane (now with a) implies that f, hz(a) da = 0 for Im(z) < 0 and 

2ri for Im(z) > 0. 0 
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In theorem (2.12.2), we have seen that any Borel measure pon the real line 
has a unique Lebesgue decomposition dp = diac + dsing = dbac + dptse + 
dptpp. The function F is related to this decomposition in the following way: 


OC Oo eee 


Proposition 5.2.3. (Facts about Borel transform) For ¢ — 0, the measures 
a 'ImF(E + ie) dE converges weakly to p. 

dptsing ({E | ImF(E + i0) = 00 }) =1, 

du({£o}) = limeo ImF (Ep + ie)e, 

dptac(E) = n-ImF(E + 10) dE. 


TOS eS 


Definition. Define for a # 0 the sets 


Sa = {reé€R|F(r+i0)=-a™, F(z) =} 
P, {x €R| F(x+i0)=-a“}, F'(z) < } 
L = {x€R|ImF(r+i0) 40} 


a ss es 
Lemma 5.2.4. (Aronzajn-Donoghue) The set Py is the set of eigenvalues of 


La. One has (dita)se(Sa) = 1 and (duta)ac(L) = 1. The sets Po, Sy, L are 
mutually disjoint. 


SSS es 


Proof. If F(E +i0) = —1/a, then 
lim ¢ ImF,(E + ie) = (a? F’(E))~? 
since F(E'+ie) = —1/a+ieF’(x)+0(e) if F'(E) < oo and e—'Im(1+aF) > 
oo if F"(E) = oo which means e|1 + aF|~! — 0 and since F — —1/a, one 
gets e|F/(1+ aF)| 0. 
The theorem of de la Vallée Poussin (see [88]) states that the set 
{E | |Fa(E + i0)| = co } 


has full (dva)sing measure. Because F, = F/(1+aF), we know that 
|Fa(E + 10)| = co is equivalent to F(E + i0) = —1/a. O 


The following criterion of Simon-Wolff [96] will be important. In the case of 
HD potentials with absolutely continuous distribution, a spectral averaging 
argument will then lead to pure point spectrum also for a = 0. 
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Theorem 5.2.5 (Simon-Wolff criterion). For any interval {a,b] C R, the 
random operator L has pure point spectrum if 


F'(E) < co 


for almost almost all E € [a, }]. 


Proof. By hypothesis, the Lebesgue measure of S = {E | F’(E) = oo } is 
zero. This means by the integral formula that duo(S) = 0 for almost all a. 
The Aronzajn-Donoghue lemma (5.2.4) implies 

Ho(SaM [a, 6]) = Ho(LN fa, 6) = 0 


so that fo has only point spectrum. O 


Lemma 5.2.6. (Formula of Simon-Wolff) For each E € R, the sum 
Vac (be 2= ie)g,/|? increases monotonically as « \, 0 and converges 
point wise to F’(E). 


Proof. For € > 0, we have 


Yoh - E- te)on 


Il 


||(L — E — ie)~*60 ||? 


neZ 
= |[(Z —-E- ie) 1 (L —E + ie) Joo| 
i d(x) 
rR (2 - EE)? +e 
from which the monotonicity and the limit follow. O 


Lemma 5.2.7. There exists a constant C, such that for all a, 8 €C 


1 1 
i |x — a|'/? |x — Bl-1/? ax>c | |x — B|-/? der. 
0 0 


Proof. We can assume without loss of generality that a € [0,1], because 
replacing a general a € C with the nearest point in [0, 1] only decreases the 
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left hand side. Because the symmetry a + 1 -<a leaves the claim invariant, 
we can also assume that a € [0,1/2]. But then 

1 1 1 
[ e- ahead > GY? fea ae. 
0 


3/4 


The function 
ng) = — tse? ae 
Jo ke — ax — B|-¥? de 


is non-zero, continuous and satisfies h(oo) = 1/4. Therefore 


C= inf A(8) >0. 


The next lemma is an estimate for the free Laplacian. 


Lemma 5.2.8. Let f,g € 1°°(Z) be nonnegative and let 0 < a < (2d)~?. 
(1-aA)f <g3f<(1-aA)"g. 


[1 — aA) ]i3 < (2da)!F-4(1 — 2da)-? . 


Proof. Since ||A|| < 2d, we can write (1 — aA)~! = >°°_,(aA)™ which is 
preserving positivity. Since [(aA)"],; = 0 for m < |i — 3| we have 


[(@A)"]3= Do (@A)™ly << SO da)”. 
m=|i—j| m=|i—3| 


We come now to the proof of theorem (5.2.1): 


Proof. In order to prove theorem (5.2.1), we have by Simon-Wolff only to 
show that F’(E) < oo for almost all E. This will be achieved by proving 
E[F’(E)1/4] < oo. By the formula of Simon-Wolff, we have therefore to 
show that 


u nr PA 2 1/4 Co. 
epee |G(n, 0, z)|?)'/4] < 


Since 


(So 1G(m, 0, z)?)/4 < S° [G(n, 0, 2)”, 
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we have only to control the later the term. Define g.(n) = G(n,0,z) and 
k,(n) = Ellgz(n)|!/?]. The aim is now to give an estimate for 


Y= k(n) 
n€Z 
which holds uniformly for Im(z) # 0. 
(i) 
E[JAV(n) — z|*/?|ge(n)[!/?] < dno + D> ke(n +5) - 
\j\=1 
Proof. (L — z)gz(n) = dno means 
(AV (n) — z)92(n) = dno — pa ge(nt+ J). 
[g|=1 
Jensen’s inequality gives 
E[|AV(n) — z[*/Ig2(n)|/?] < bno + Dy ke(n + J) - 
\j\=1 
(ii) 
E(|AV(n) — 2|"/?|gz(n)[/7] > CA*/?k(n) 


Proof. We can write g,(n) = A/(AV(n) + B), where A, B are functions of 
{V(l)}i4n. The independent random variables V(k) can be realized over 
the probability space 2 = (0, 1)” = [],ez 2(k). We average now |AV(n) — 
z|/2\g.(n)|!/2 over 2(n) and use an elementary integral estimate: 


|Av — z|)/2| Aj}/2 1 [ 2 -1)— 
i A ya v— zr i v+Bx 1-1/2 du 
i |Av + Bl1/2 | | 0 | ! | 


1 
> cat? f lv + BAY? de 
0 


= cv? 0 |A/(v + B)|!/? 
0 
= Blg.(n)/?} = ke(n) 
(iii) 
k,(n) < (C\/?)-} (x hiots)+b] ; 


\gl=1 
Proof. Follows directly from (i) and (ii). 
(iv) 
(1 -—CM/?A)k < dno - 
Proof. Rewriting (iii). 
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(v) Define a = Cy}/2, 


kz(n) < a7 *(2d/a)!"\(1 — 2d/a)7) . 


Proof. For Im(z) # 0, we have k, € 1°(Z). From lemma (5.2.8) and (iv), 
we have 


R(n) S a7[(1 — A/a) Non < a-t(=)imMen — 2)-1 


(vi) For \ > 4C~?, we get a uniform bound for yoke): 

Proof. Since CA!/? < 1/2, we get the estimate from (v). 

(vii) Pure point spectrum. 

Proof. By Simon- Wolff, we have pure point spectrum for Lq for almost all 
a. Because the set of random operators of L, and Lo coincide on a set of 
measure > 1 — 2a, we get also pure point spectrum of L,, for almost all 
Ww. O 


3.3 Estimation theory 


Estimation theory is a branch of mathematical statistics. The aim is to 
estimate continuous or discrete parameters for models in an optimal way. 
This leads to extremization problems. We start with some terminology. 


Definition. A collection (,A,P 9) of probability spaces is called a statis- 
tical model. If X is a random variable, its expectation with respect to the 
measure Pg is denoted by E9[X], its variance is Vare[X] = Eg|(X—Eg[X 17h 
If X is continuous, then its probability density function is denoted by fo. 
In that case one has of course Eg[X] = Sq fo(x) dx. The parameters 0 are 
taken from a parameter space O, which is assumed to be a subset of R or 


R*. 


Definition. A probability distribution 1 = p(0) dé on (9, B) is called an 
a priori distribution on © C R. It allows to define the global expectation 
E[X] = fz Eo[X] du(0). 


Definition. Given n independent and identically distributed random vari- 
ables Xi,...,X» on the probability space (2, A, Ps), we want to estimate 
a quantity 9(9) using an estimator T(w) = t(Xi(w),..., Xn(w)). 


Example. If the quantity 9(9) = Eg[X;] is the expectation of the ran- 
dom variables, we can look at the estimator T(w) = eee X;(w), the 
arithmetic mean. The arithmetic mean is natural because for any data 
L1,--.,%n, the function f(r) = )>7_,(a; — x)? is minimized by the arith- 
metic mean of the data. 


Example. We can also take the estimator T(w) which is the median of 
Xi(w),...,Xn(w). The median is a natural quantity because the function 
f(x) = YL, [zi — 2| is minimized by the median. Proof. |a — x| + |b—2| = 
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|b — al + C(x), where C(2) is zero ifa < z < 6 and C(z) = 2-—b if 
z >band D(x) =a-—z if xz <a. Ifn = 2m +1 is odd, we have f(z) = 
yak |es—Fnt1—t+d oe >2m C(23)+ Ln, <2, P(t;) which is minimized for 
© = Im. Ifn = 2m, we have f(x) = 524 \2i—2n41-ilt+ Digs samer C(t) + 
ieee D(2x;) which is minimized for x € [%m,Zm+i]- 

Example. Define the bias of an estimator T' as 

B(8) = Bo{T] = Eo|T] — 9(9) - 


The bias is also called the systematic error. If the bias is zero, the estimator 
is called unbiased. With an a priori distribution on ©, one can define the 
global error B(T) = [, B(8) du(8). 


Proposition 5.3.1. A linear estimator T(w) = )75_ 0iXi(w) with 7,04 = 
1 is unbiased for the estimator g(#) = Ee[X%). 


Proof. Eo{T] = 07, oEe[Xi] = Eo[Xil. O 


Proposition 5.3.2. For g(@) = Vare[X;] and fixed mean m, the estimator 
T= 5 Dye (Xi - m)? is unbiased. If the mean is unknown, the estimator 


T= 3,07, (%i — X)? with X = + OL, Xi is unbiased. 


Proof. a) Eo[T| = 2 )75_,(Xi — m)? = Vare[T] = 9(8). 


b) For T = + 37,(X; — X;)?, we get 


n 


1 
- 2 
BolT] = BelX?]— Bel Do XXi) 
%7 
1 n(n —1 
= E9[X?]- —E6[X/] = mn Dex: 
_ 1 a) na 1 42 
= (1~2)B6[x?] - “—E41x;) 
-1 
ad e Varg [X;] a 
n 
Therefore n/(n — 1)T is the correct unbiased estimate. 0 


Remark. Part b) is the reason, why statisticians often take the average of 
(Cosh) (x; —Z) as an estimate for the variance of n data points z; with mean 
m if the actual mean value m is not known. 
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Definition. The expectation of the quadratic estimation error 
Erre[Z'] = Eo[(T — 9(8))”] 


is called the risk function or the mean square error of the estimator T. It 
measures the estimator performance. We have 


Erre[T] = Vare[T] + BolT] , 


where Bg[T| is the bias. 


Example. If T is unbiased, then Erre[T] = Vare[7). 


Example. The arithmetic mean is the ”best linear unbiased estimator”. 
Proof. With T = >, ai:X;, where °; a; = 1, the risk function is 


Errg|[T] = Vare[T] = > a?Vare|Xi] - 


It is by Lagrange minimal for a; = 1/n. 


Definition. For continuous random variables, the maximum likelihood func- 
tion t(a1,...,2n) is defined as the maximum of 6 + Lo(z1,...,2n) := 
fo(z1)--+-- fo(zn). The maximum likelihood estimator is the random vari- 
able 

T(w) = t(X1(w),...,Xn(w)) - 


For discrete random variables, Lo(x1,...,2n) would be replaced by Po[X1 = 
Diy casphn = on) 

One also looks at the maximum a posteriori estimator, which is the maxi- 
mum of 


6 ++ Lo(z1,.--,2n) = fo(x1) eaten fo(tn)p(@) ’ 
where p(@) d@ was the a priori distribution on @. 


Definition. The minimax principle is the aim to find 
min max R(6,T) . 
T 6 
The Bayes principle is the aim to find 
suit [ (R(6,T) du(6) . 
T Je 

Example. Assume fo(r) = Le-lz-4, The maximum likelihood function 

L — 1 .- Ey la-el 

o(@1,---)2n) = Fre 3 


is maximal when >, |x; — 6| is minimal which means that t(@1,...,2n) is 
the median of the data x1,...,2n- 
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Example. Assume fo(z) = §*e~® /z! is the probability density of the Pois- 
son distribution. The maximal likelihood function 


envi log(@)ai—n@ 


lo(t1,---)2n) = gyl-+- 2 J 
. n° 


is maximal for 6 = }>j_, 2: /n. 


Example. The maximum likelihood estimator as Ks = im, o”) for Gaussian 


distributed random variables fe(z) = Teese 20? has the maximum 
likelihood function maximized for 


UZ -+5En) = (— ar ~ Ye Xi —)?). 


Definition. Define the Fisher information of a random variable X with 
density fo as 


I(@ [& xz) dz. 
(6) = | (2 fol fi fo(z) 
If @ is a vector, one defines the Fisher information matrix 


fo,S¢ 


1,;(0) = “aa fo fodz. 


a 
Lemma 5.3.3. I(@) = Vare|#]. 


ee ee SS =O 


Proof. E(#] = fo fadx = 0 so that 


Varel o) = By (A ). 


Lemma 5.3.4. [(0) = —Eg[(log( fe)”. 


Proof. Integration by parts gives: 


Ellog(fe)”} = J reettey"f dx = — [retse)'s dz = — [ssl fo)*fe dz . 


O 
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Definition. The score function for a continuous random variable is defined 
as the logarithmic derivative pp = f}/ fg. One has I (9) = Eg[p3] = Varo[po]. 


Example. If X is a Gaussian random variable, the score function pg = 
f’(8)/f(@) = —( — m)/(o?) is linear and has variance 1. The Fisher in- 
formation I is 1/07. We see that Var[X] = 1/I. This is a special case 
n=1,T = X,0 =m of the following bound: 


eee 


Theorem 5.3.5 (Rao-Cramer inequality). 


; U 2 
Varg [T] > a 


In the unbiased case, one has 


Erre[T] > >—_—— aT 


eee 


Proof. 1)-6 + B(@) = Eo[T] = f t(21,. ++50n)Lo(a1,...,2n) dry +++ dan. 
2) 


1+ B’(6) 


lI 


fees sandolers. 2) dx, ---dtn 


Del Eisen s tn) 
[teu yaq) hota) dx din 


Lt 
To 
Eo| Ly! 
3) 1= Jf Lo(x1,...,2n) dx, ---d2my implies 


= f ti (0s,..-52n)/ oles...) = BES /Ee). 


4) Using 3) and 2) 
Cov[T, Le/Le] 


Eo[TLe/Le] — 
1+ B’(0). 
5) 


(1+ BY(9))? 


lI 


Li 
2 @ 
Cov" [T, I, 


Lp 
Vare|T]Varg F! 


lA 


= VarelT] Bol (Joy 


= Vare[T] nI(6) , 
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where we used 4), the lemma and 
Lo/Lo =) folzi)/ fo(as) - 
i=l 


0 


Definition. Closely related to the Fisher information is the already defined 
Shannon entropy of a random variable X: 


s(0) =~ f folog( fo) de, 
as well as the power entropy 


1 
N(6) = rrcaee . 


Theorem 5.3.6 (Information Inequalities). If X,Y are independent random 
variables then the following inequalities hold: 

a) Fisher information inequality: Seer > Le + i 

b) Power entropy inequality: Nxiy > Nx + Ny. 

c) Uncertainty property: Ix Nx > 1. 

In all cases, equality holds if and only if the random variables are Gaussian. 


Proof. a) Ix4y < c?Ix + (1—)?Iy is proven using the Jensen inequal- 
ity (2.5.1). Take then c = Iy/(Ix + Iy). 
b) and c) are exercises. 0 


Theorem 5.3.7 (Rao-Cramer bound). A random variable X with mean m 
and variance o? satisfies: Ix > 1/07. Equality holds if and only if X is the 
Normal distribution. 


Proof. This is a special case of Rao-Cramer inequality, where @ is fixed, 
n = 1. The bias is automatically zero. A direct computation giving also 
uniqueness: E[(aX + b)p(X)] = f(ax + b)f’(x) dx = -af f(x) dx = -a 
implies 


0 


IA 


E[(o(X) + (X — m)/o?)?| 
E[(o(X)?] + 2E[(X — m)p(X)]/o? + E[(X — m)?/o'| 
< Ix — 2/07 + loe P 


Equality holds if and only if px is linear, that is if X is normal. O 
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We see that the normal distribution has the smallest Fisher information 
among all distributions with the same variance o?. 


5.4 Vlasov dynamics 


Vlasov dynamics generalizes Hamiltonian n-body particle dynamics. It deals 
with the evolution of the law P* of a discrete random vector X*. If P* is 
a discrete measure located on finitely many points, then it is the usual 
dynamics of n bodies which attract or repel each other. In general, the 
stochastic process X* describes the evolution of densities or the evolution 
of surfaces. It is an important feature of Vlasov theory that while the ran- 
dom variables X‘ stay smooth, their laws P* can develop singularities. This 
can be useful to model shocks. Due to the overlap of this section with geom- 
etry and dynamics, the notation slightly changes in this section. We write 
X* for the stochastic process for example and not X; as before. 


Definition. Let (2 = M be a 2p-dimensional Euclidean space or torus with 
a probability measure m and let N be an Euclidean space of dimension 2g. 
Given a potential V : RY — R, the Vlasov flow X* = (f',gt): M —> N is 
defined by the differential equation 


f2gh5- i, VV(f(w) — f(n)) dm(n) . 


These equations are called the Hamiltonian equations of the Vlasov flow. 
We can interpret X* as a vector-valued stochastic process on the probability 
space (M,.A,m). The probability space (M, A, ne labels the particles which 
move on the target space N. 


Example. If p = 0 and M isa finite set 2 = {w1,...,wn}, then X* describes 
the evolution of n particles (f;,9:) = X(w;). Vlasov dynamics is therefore 
a generalization of n-body dynamics. For example, if 


V(a1,....2n =F , 


then VV (x) = z and the Vlasov Hamiltonian system 


f= 9.9) =f fe) -F(0) a(n) 
is equivalent to the n-body eyalition 
f 1 = 4 


— Sofi - fi) - 
j=l 


In a center of mass coordinate system where )>;"_, fi(z) = 0, this simplifies 
to a system of coupled harmonic oscillators 


2 
Sahil) = fila). 


% 
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Example. If N = M = R? and m is a measure, then the process X i 
describes a volume-preserving deformation of the plane M. In other words, 
Xt is a one-parameter family of volume-preserving diffeomorphisms in the 


plane. 


Figure. An erample with M = 
N = R?, where the measure m 
is located on 2 points. The Vlasov 
evolution describes a deformation 
of the plane. The situation 1s 
shown at time t = 0. The coor- 
dinates (x,y) describe the position 
and the speed of the particles. 


Figure. The situation at time t = 
0.1. The two particles have evolved 
in the phase space N. Each point 
moves as “test particle” in the 
force field of the 2 particles. Even 
so the 2 body problem is inte- 
grable, its periodic motion acts like 
a "mizer” for the complicated evo- 


lution of the test particles. 


Example. Let M = N = R? and assume that the measure m has its support 
on a smooth closed curve C. The process X° is again a volume-preserving 
deformation of the plane. It describes the evolution of a continuum of par- 
ticles on the curve. Dynamically, it can for example describe the evolution 
of a curve where each part of the curve interacts with each other part. The 
picture sequence below shows the evolution of a particle gas with support 
on a closed curve in phase space. The interaction potential is V(x) = e~*. 
Because the curve at time t is the image of the diffeomorphism X‘, it 
will never have self intersections. The curvature of the curve is expected 
to grow exponentially at many points. The deformation transformation 
X*t = (f',g*) satisfies the differential equation 


d 

at = 9 

d 

<9 = i e-F)-£M) dm(n) . 
M 


If r(s), s € [0,1] is the parameterization of the curve C’ so that m(r[a, b]) = 
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(b~ a), then the equations are 


d t aa t 

Sf'@) = ge) 

ee [verre pe 
dt : 


The evolved curve C* at time ¢ is parameterized by s > (f*(r(s)), g*(r(s))). 


$ O 


Figure. The support Figure. The support Figure. The support 
of the measure P® on of the measure P°4 on _of the measure P}-? on 
N =R?. N = R?. N=R?. 


ee 


Example. If X*‘ is a stochastic process on (Q = M,.A,m) with takes values 
in N, then P* is a probability measure on N defined by P*[A] = m(X71A). 
It is called the push-forward measure or law of the random vector X. The 
measure P* is a measure in the phase space N. The Vlasov evolution defines 
a family of probability spaces (N,B,P*). The spatial particle density p is 
the law of the random variable x(x, y) = z. 


Example. Assume the measure P® is located on a curve 7(s) = (s, sin(s)) 
and assume that there is no particle interaction at all: V = 0. Then P* is 
supported on a curve (s +sin(s),sin(s)). While the spatial particle density 
has initially a smooth density \/1 + cos(s)?, it becomes discontinuous after 
some time. 
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Figure. Already for the free evo- 
lution of particles on a curve in 
phase space, the spatial particle 
density can become non-smooth 
after some time. 


Example. In the case of the quadratic potential V(r) = x* /2 assume m has 
a density p(x, y) = e~*'-2u" then Pt has the density p’(x,y) = f(xcos(t)+ 
ysin(t), —xsin(t) + ycos(t)). To get from this density in the phase space, 
the spatial density of particles, we have to do integrate y out and do a 
conditional expectation. 


A 


Lemma 5.4.1. (Maxwell) If X* = (f*,g') is a solution of the Vlasov Hamil- 
tonian flow, then the law P* = (X*)*m satisfies the Vlasov equation 


P'(2,y) +y- VeP*(x,y) — W(a)-VyP"(2,y) = 0 
with W(x) = fy, VeV (x — 2’): P*(z’,y’)) dy'da’. 


TTT nna 


Proof. We have f VV(f(w) — f(n)) dm(n) = W(f(w)). Given a smooth 
function h on N of compact support, we calculate 


b= [neu Z Prev) dedy 
N dt 


as follows: 
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ae 7 A(z, y)P* (x,y) drdy 
dt Jy 


aa ae A(f(w, t), g(w, t)) dm(w) ; 


ia Vab(F(w,t), (ws t))a(unt) drm) 
~ [ Vor(Fo,0),9(0,2) J 9vte) - Fn) am(n) am(w) 
M M 
= [ verte, yyP(e,) dedy ~ P*(z, y)Vyh(z, y) 
N N 


| VV (2 — 2')P*(2',y’) dx'dy'dxdy 
Nn. | | 


dpe My homptace gi aa yy aH gg Rhee ay dogs fia tbacetas Galt te gees souls gh obgerpcpe “8 
m, =~ [ hey) VeP a, y)u aedy. TA? are eG, Neatane see 
+ h(ayy)W (a) Vy Pi a,y) dedy ccc eh tt ee onl 

N Hien age SEARS Sie 

O 


Remark. The Vlasov equation is an example of an integro-differential equa- 
tion. The right hand side is an integral. In a short hand notation, the Vlasov 
sue i ns Ako ear pigeons 

foi Ey stag! » By Pe We) Pg 05 ecole bt ange 
where W =.V,V'*P is'the convolution of the force VV with Pe 
Example. V(x) = 0. Particles move freely. The Vlasov equation becomes 
the transport equation P(a, yt). + y: VePt(a,y) = 0 which is-in ‘one: di- 
mensions a partial differential equation u, + yu; = 0. It has solutions 
u(t,2,y) = uu, 2+ ty). Restricting this function to y = 2 gives the Burg- 
ers equation u; + zu, = 0. 


Example. For a quadratic potential V(x) = x2, the Hamilton equations are 
Fw) =-(6(0)— ff sea) demir) 


In center-of-mass-coordinates f = f — E[f], the system is a decoupled 
system of a.continuum of oscillators f =. 9,9 = —f with solutions, ’ 
f(t) = (0) cos(t) + g(0)sin(e), g(t) =f (0Ysin(t) + 9(0) cos(#) . 
The evolution for the density P is the partial differential equation 
SP Me.) +9 WeP*(e9) —'eV,P(2,y) =0 


written in short hand as u4.+y-uz —Z-Uuy = 0, which has the explicit solution 
P*(z,y) = P°(cos(t)x + sin(t)y, — sin(t)z + cos(t)y). It is an example:of a 
Hamilton-Jacobi equation. 
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Example. .On any Riemannian manifold with, Laplace-Beltrami, operator 
A. there are, natural potentials: the. Poisson equation Ad =p. is solved by 


d= V x py where * is the convolution. This, defines Newton potentials 9 on 
the manifold. Here a are some examples: 


For ear ‘for ‘N= R, the Leplacten: A ea fl is ities second aie 
tive. It is diagonal i in Fourier space: Af (k) = —k? fy; where’k €'R. From 
Deltaf(k) = —k* f = (k) we get F(k) = —(1/k*)A(k),.so that f =V xp, 
where V is the function which has the Fourier transform Ue -1 oY i 
But V (2). =.|2|/2 has this Fourier transform: . ep seeh dated 


ol ip ji 
(ee ee. 


Also for N = T, the Laplacian A f= = fl ‘is diagonal in “‘Rourier | space. ‘Tt 
isthe 27-periodic function V(x) = ahs = *) / its) .which has the: Fourier 
series V(k) = —1/k?. 


ioe eenaea N = R", see for example [58]. 


Remark. The fiction G,(z) = V(e- a: is also called the Green, functio 
of the Laplacian. Because Newton potentials V are not smooth, establishing 
global existence. for the Vlasov dynamics i is not easy but it has been done 
in many cases (30). The potential \a| models galaxy motion and appears in 
plasma dynamics [90, 65, 82]. 


Lemma 5.4.2. (Gronwall) If a function u satisfies wu’ (t) < (g(t) |u(t) for all 
0<t<T, then u(t) < u(0) exp( fj \g(s)| ds),forO<t <T. 


Proof. Integrating the assumption gives u(t) - < u(0) + i g(s)u(s) ds. The 
function A(t) satisfying the differential equation h'(t) = | g(t) |u(t) satisfies 
h(t) < < la) |A(t). This leads to h(t )< (0) exp( Jo, lg(s)| ds) § so that u(t) < < 

u(0) 0) exp( to lo(s)| ds). This proof, for real valued functions (20] generalizes 
to a cage, where u'(z) evolves in a function space, , One just can. apply. the 
same proof for any fixed z. 
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i 


Theorem 5.4.3 (Batt-Neunzert-Brown-Hepp-Dobrushin). If V.zV_ is 
bounded and globally Lipshitz continuous, then the Hamiltonian Vlasov 
flow has a unique global solution X* and consequently, the Vlasov equa- 
tion has a unique and global solution P* in the space of measures. If V and 
P° are smooth, then P* is piecewise smooth. 


a SS eee ee ee 


Proof. The Hamiltonian differential equation for X = (f,g) evolves on on 
the complete metric space of all continuous maps from M to N. The dis- 
tance is d(X,Y) = supycy d(X(w), Y(w)), where d is the distance in N. 


We have to show that the differential equation f =g and g = G(f) = 
— fu VeV (F) — f(n)) dm(n) in C(M, N) has a unique solution: because 
of Lipshitz continuity 


IGF) — G(F)lloo $ 21D (VeV)Ileo “If — FIle 


the standard Piccard existence theorem for differential equations assures 
local existence of solutions. 


The Gronwall’s lemma assures that ||X(w)|| can not grow faster than ex- 
ponentially. This gives the global existence. O 


Remark. If m is a point measure supported on finitely many points, then 
one could also ok the global existence theorem for differential equations. 
For smooth potentials, the dynamics depends continuously on the measure 
m. One could approximate a smooth measure m by point measures. 


Definition. The evolution of DX‘ at a point w € M is called the linearized 
Vlasov flow. It is the differential equation 


Diu) == ie VV (f(w) — f(n)) drm(n) Df (w) =: B(F*)DF(w) 


and we can write it as a first order differential equation 


dod [ff 
ee =| 4 


i | fy V2 (Fw) — FM) dm(n) | | il 
ats) | A 


Remark. The rank of the matrix DX*(w) stays constant. Df*(w) is a lin- 
ear combination of Df°(w) and Dg°(w). Critical points of f* can only 
appear for w, where Df°(w), Df°(w) are linearly dependent. More gen- 
erally Y;(t) = {w € M | DX*(w) has rank 2q — k = dim(N) — k} is time 
independent. The set Y, contains {w | D(f)(w) = AD(g)(w), A € RU{oo}}. 
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Definition. The random variable 


Nw) = lim sup Flog(||D(X*(e))I)  [0, oo] 


is called the maximal Lyapunov exponent of the SL(2q, R)-cocycle A= 
A(f*) along an orbit X* = (f’, g*) of the Vlasov flow. The Lyapunov expo- 
nent could be infinite. Differentiation of Df = B(f*)f* at a critical point 
wt gives D? ft(wt) = B(f*)D?ft(w*). The eigenvalues A; of the Hessian 
D*f satisfy 4; = B(f*)d;- 


Definition. Time independent solutions of the Vlasov equation are called 
equilibrium measures or stationary solutions. 


Definition. One can construct some of them with a Maxwellian ansatz 
2 
P(a,y) = Coxm(-A(G + | V(e—2')Q(e") da)) = SHIA) 


The constant C is chosen such that fp. S(y) dy = 1. These measures are 
called Bernstein-Green-Kruskal (BGK) modes. 


vr ea 


Proposition 5.4.4. If Q: N +> R satisfies the integral equation 
Q(z) = ex(- [AV (e - 2')Q(2") de! = exp(-AV * Q(x) 


then the Maxwellian distribution P(z,y) = S(y)Q(x) is an equilibrium 
solution of the Vlasov equation to the potential V. 


Proof. 


yVaP yS(y)Qz(z) 


yStu)(-BQ(e) [ VaV(e— 2')Q(2!) de’) 


Il 


and 
i V2V (2 —2')VyP(z,y)P(2',y’) da'dy’ 
N 
= Q(2)(-85(y)v) | VeV(@ - 2)Q(@!) ae! 


gives yV2P(z,y) = fy VeV(e — 2')VyP(z,y)P(2’,y') dx’ dy’. 0 
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5.5 Multidimensional distributions 


Random variables which are vector-valued can be treated in an analogous 
way as random variables. One often adds the term ” multivariate” to indi- 
cate that one has qanteiple Glensions: 


Definition. A random vestor is. a a vector-valued fandonn variable. It j isin 
LP if each coordinate. is in £?’. The expectation E{X]_of. a. random vector. 
X = Ga: ,Xq) is the Vector , (E[Xj], ... .,E{Xa]), the variance is. the 
vector (Var[X1], ., Var[Xql). oe ta . tee 


Example. The random vector X = (23, y/* vs z*) on the unit cube 2 = | 
with Lebesgue measure P has the, expectation E[X] = a (1/4, 1 [ 5, 1 1/¢ 6). 


Definition. Assume X = (X1,...,Xq) is a random vector in £©. The law 
of the random: vector; X jis a: apeasure. on, R? with compact support: After; 
some scaling and translation we can assume that #t be a bounded Borel 
measure on the unit cube J’ d on = 0, 18. 


Definition, The ulti-dimensional ope bation imation ofa giewiass vector, 
= (X,. ..,Xq) is defined as. Er a eee 


Ste eee deatay ras ehcp 


Px) = Fix, Xa) (ty ata) = PIX St. ->Xa Sta] . 


For a continuous random variable, there i is a density J x(t), Satisfying 


ty ta 
Fx(t) = / wef $81 84) d8y *-dsa. 
—co 00 as 


The multi-dimensional: esl une é eas calles: makirare:d oe 
tribution function. Fete oe as astnigaatn Syegh, bet fon gs i Bagi se 


Definition. We use in . this section the multi-index notation a= Tt 2 ae 
Denote by tin = f jac” dy the n’th moment of yp. If X is a random 
vector, with law p, call pn(X) the n’th moment of X. It is equal-:to’ 
E[X"] = E[X[" X3?--- X74]. We call the map n € N44 yu, the moment 
configuration or, if d = 1, the moment sequence. We will tacitly assume 
Hn = 0, if at least one coordinate n,:in n = (n,.. .,Nd) is negative. 


If X is a continuous random vector, the moments satisfy 
pn(X) = [a f(2) de 
Re 


which is a short hand notation for | 


fo gee at Flas 42 tq) diy: dag « 


OY pdt Cowie 4 ee Aish 
Ms ha ro Wel es CT VEY Oe nS Wan euyre 


5.5. Multidimensional distributions 307 


Example. The n = (7,3, 4)’th moment of the random vector X = (z*, y* 2°) 
_ Mat 
oe lee tae ae 
n2yn3) _ 21,12), 207 
E[XT 1x? xns) =  E[x2ty!22] = 221320" 
The random vector X' is continuous and has ‘the probability density 
3/4 4-4/5 


2/3 y 
flew =— IAA): 


Remark. As in one dimension, one can define a ano timenstobe moment 
generating function, 6 2) 52) att bt RS pate : 


Mx(t) = Bie] = wie okie: nen 


which contains all the information about the moments because of the multi- 
dimensional moment formula 


en Se A Src 


whee ‘the nth derivative is defined as 


d am gm ana 
Gal xr) = day" on5? ee . 


Example. The random variable X= ! ta, Juz zi/ 3) has the moment gener- 
ating function: 


sack t, “ = L f iS estttyitue!/ eae 


ee Saatet B : bin genes ae 4 


Because sia: anipenehes X4,X2,.X3 in: this ample were: He tan- 
dom variables, the moment generating function is of the form 


M(s)M(t)M(u) , 


where the ‘factors: are the one-dimensional moments ot ae one-dimensional 
random variables’ X1, Xe and’ Xa. pee eed Caf pavowl 


Definition. Tet: e; be the standard baie’ in, Zt. Define. the partial difference 
(Aia)n = @n—e, — Gn On configurations and write AF =], Ar . Unlike the 
usual convention, we. take a particular sign conventiop for A. This allows 
us to avoid many negative signs in this section. By induction i in ae Ni, 
one proves the relation 


(A*u)n = I ama) dus s  * Bl) 
qd 
using 2?—et—*(1—a)*§ =2?-*(1—2)F = ee To improve read- 


ei : . k ky 
opuity, we also use notation like = = Tima me or = TI, its 


i; phy ‘ 


My phi shee MES Spit 


yee retry absitigget tod fb cd 1420s 
i ina : 

py We mean ft —> 00 ih th ense that n; —> 00 
Phe ae vk aa ker Ooi Meats oot fe ve e that + Papen 


WO a Spepttecou: gil (0) vertibion gobi ay owesoni sda 
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Definition. Given a continuous function f : [4 > R. For n € N¢,n; > 0 we 
define the higher dimensional Bernstein polynomials 


“mn n ok 
Bal f)(2) > tea] Teta a. 


Lemma 5.5.1. (Multidimensional Bernstein) In the uniform topology in 
C(I*), we have B,(f) > f ifn - co. 


Proof. By the Weierstrass theorem, multi-dimensional polynomials are dense 
in C(I*) as they separate points in C(JI*). It is therefore enough to prove 

the claim for f(z) =z” = pee zi. Because B,(y™)(x) is the product of 

one dimensional Bernstein polynomials 


d 


Bn(y™)(2) = [| Bni(y")(za) 5 


i=1 


the claim follows from the result corollary (2.6.2) in one dimensions. O 


Remark. Hildebrandt and Schoenberg refer for the proof of lemma (5.5.1) 
to Bernstein’s proof in one dimension. While a higher dimensional adapta- 
tion of the probabilistic proof could be done involving a stochastic process 
in Z¢ with drift x; in the i’th direction, the factorization argument is more 
elegant. 


Theorem 5.5.2 (Hausdorff,Hildebrandt-Schoenberg). There is a bijection 
between signed bounded Borel measures yz on (0, 1]? and configurations pn 
for which there exists a constant C such that 


so ( ; ) (ada <C, Vn eEN?. | (5.2) 
k=0 


A configuration {in belongs to a positive measure if and only if additionally 
to (5.2) one has (A*y1)n > 0 for all k,n € N¢, 


Proof. (i) Because by lemma (5.5.1), polynomials are dense in C (I), there 
exists a unique solution to the moment problem. We show now existence 
of a measure pz under condition (5.2). For a measures 1, define for n € N? 
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the atomic measures y‘") on [4 which have weights ( ; ) (A¥ 4), on the 


Tita (n +1) points (moe, ho naka) € I? with 0 < kj < n;. Because 
- n n—k 
™du™(c) = ——)™(A*y)n 
[era (5)! — ym nt) 


ll 
> 
= 
Pj 
= 
fo 
xs 3 


) oF yrar a ~ a) dul) 


(& ) Gyre apr aut 


Ii 
> 
M 


: 1 
| Ba(y™)(2) du(e) i 2” du(z) , 
Id 0 


we know that any signed measure p which is an accumulation point of p), 
where n; — oo solves the moment problem. The condition (5.2) implies that 
the variation of the measures y‘") is bounded. By Alaoglu’s theorem, there 
exists an accumulation point p. 


(ii) The left hand side of (5.2) is the variation ||" || of the measure p). 
Because by (i) py”) — p, and p has finite variation, there exists a constant 
C such that || || < C for all n. This establishes (5.2). 


(iii) We see that if (A*u), > 0 for all k, then the measures pu) are all 
positive and therefore also the measure p. 


(iv) If » is a positive measure, then by (5.1) 
n n 2 
( k ) ahaa = ( ; v a” *(1 — 2) du(x) >0. 
[¢é 


Remark. Hildebrandt and Schoenberg noted in 1933, that this result gives 
a constructive proof of the Riesz representation theorem stating that the 
dual of C(I4) is the space of Borel measures M(I*). 


O 


Definition. Let 5(2) denote the Dirac point measure located on z € I eit 
satisfies fj 6(x) dy = z. 


We extract from the proof of theorem (5.5.2) the construction: 


Corollary 5.5.3. An explicit finite constructive approximations of 'a given 
measure pt on I? is given for n € N¢ by the atomic measures 


w= TS (2) ata (BE... BER. 


O<ki<ni na 
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Hausdorff established a criterion for absolutely continuity of a measure pu 
with respect to the Lebesgue measure on (0, 1] [72]. This can be generalized 
to find a criterion for comparing two arbitrary measures and works in d 
dimensions. — cad . 


Definition. As usual, we call a measure 4 on J uniformly absolutely con- 
tinuous with respect to y, if it satisfies p= f dy with f € L(I4). 


Corollary 5.5.4. A positive probability measure y is uniformly absolutely 
continuous with respect ‘to a second probability measure v if and only if 
there exists a constant C' such that (A*u), < C-(A*v), for all k,n € N¢, 


[arta a fauce) 

ia ; r 

S Nflleo f ca *(1 — 2) du(a) 

INF lloo(A*v)n « ; 

On the other hand, if (A*u)n < C(A*v)n ‘ehen Pn = C(A*y), — (AF u)n 
defines by theorem (5.5.2) a positive measure p on I*. Since p= Cv—p, 
we have for any Borel set A C I¢ p(A) > 0. This gives H(A) < Cv(A) and 


implies that y is absolutely continuous with respect to v with a-function f 
satisfying f(z) < C almost everywhere.’ — a O 


(Bie = fa *a a) aula) 


Hl 


This leads to a higher dimensional generalization of Hausdorff’s result 
which allows to characterize the continuity of a multidimensional random 


‘ TERE Dae 


vector from its moments: |” 


Corollary 5.5.5. A-Borel probability measure yon J? is. uniformly; :abso- 
lutely continuous with respect to Lebesgue measure on.J@ if. and only. if 
|A¥un| < ( - T(r +11). for-all.k and n, 


fies ett SE Be gaaleren 3 
and fra 2” dx = Il, ( Ne ) IT(m +1). 0 


1, n wo. ” A AS { ae ae J 3. ES 
There is also a characterizatign of Hatisdorff of L? measures on (4 = [0,1] 
for p > 2. This has an obvious generalization to d ditnensions: 


Proof. Use corollary (5.5.4) 


iA 7 4,a\ in 
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Proposition 5.5. 6. Given a bounded positive probability measure rv € 
M(I®) and assume 1 < p < oo. Then p € L?(I*) if and only if there 
exists a constant Cc such that for all k, n 


mtr Swin( Pro. 68 
k=0 Ay ty rer 


Proof. (i) Let’ pe” be'the measures of corollary (5.5.3). We’ construct first 
from the atomic measures ‘pb absolutely continuous’ measures pi = 
geo on f ¢ given by'a function 9 win takes oe constant value : 


(IA*(u) a Qc, 


on a cube of side lengths 1/(n; + 1) centered at the point (n — k)/n € I?. 
Because the cube has Lebesgue volume (n + 1)~} = TZ (n +1)7}, it has 
the same measure with respect to both ji) and g‘")dzx. We have therefore 
also g™ dx — p weakly. 


(ii) Assume ; “= = fdx with f € L?. Because g\")dz: +» fda in the weak 
topology for measures, we have g(") — f weakly in L?: But then, there 
exists a constant C’ such that ||g||p < C and this is equivalent to (5.3). 


(iii) On the other hand, assumption (5.3) means that ||g||p < C, where 
g‘™ was constructed in (i). Since the unit-ball in the reflexive Banach space 
L?(I*) is weakly compact for p € (0, 1), a subsequence of g‘™ converges to 
a function g € L?. This implies that a subsequence of g‘™ dx converges as 
a measure to gdx which is in L? and which is equal to uz by the uniqueness 
of the moment problem (Weierstrass). 7 Oo 


5.6 Poisson processes... 
Definition. A Poisson process (S, P, I, N) over a probability space (2, F, Q) 
is given by a complete metric space S, a non-atomic finite Borel measure 


P on S and a function w + II(w) C S$ from 2 to the set of finite subsets of 
s such’ that’ for ed measurable set B Cc 5, the map_ 


PIs}. 
‘ion 


is. a ; Polson distributed a a parable with Bee PIB). For any. 
finite partition {B; }%_, of S, the set of random variables {Na,}7i1 | have to: 
be independent. The measure P is called the mean measure of the process. 
‘Har: a -denotés the cardinality ea a finite, Set ie Me bs fica ase var 
Wale) 4 Dif we ES%= {0}. : 


— Naw) = ltl mn Bl 
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Example. We have encougtered the one-dimensional Poisson process in 
the last chapter as a martingale. We started with IID Poisson distributed 
random variables X;, which are "waiting times” and defined Ni(w) = 
4 15,(w)<t- Lets translate this into the current framework. The set S 
is {0,t] with Lebesgue measure P as mean measure. The set II(w) is the 
discrete point set II(w) = {Sp(w) | n = 1,2,3,... } MS. For every Borel set 
B in S, we have 

|M(w) A BI 

|M(w)| 


Remark. The Poisson process is an example of a point process, because 
we can see it as assigning a random point set II(w) on S which has density 
P on S. If S is part of the Euclidean space and the mean measure P is 
continuous P = fdz, then the interpretation is that f(x) is the average 
density of points at zx. 


Np(w) =t 


? 


Figure. A Poisson process in R? 
with mean density 


~2?-y? 


é€ 


P= 


vis dardy . i 3 : ws “- oe i rs 


Theorem 5.6.1 (Existence of Poisson processes). For every non-atomic mea- 
sure P on S, there exists a Poisson process. 


nO 


Proof. Define 2 = U3, $4, where S? = Sx---xS is the Cartesian product 
and S° = {0}. Let F be the Borel o-algebra on 2. The probability measure 
Q restricted to S? is the product measure (P xP x-- +x P)-Q[Ns = d], where 
QINs = d] = Q[S4] = e~PISI(d!)-1P[S]*. Define Iw) = {w1,-.-,wa} if 
w € $4 and Nz as above. One readily checks that (S, P, II, N) is a Poisson 
process on the probability space (, F, Q): For any measurable partition 
{B;}%ho of S, we have 


di PIB,|® 
Ul [Bj] 


dr -d,i LL Piss 


m 
QINa, = diy...) NBp = 4m | Ng =do+) 4; = d] = 
j=1 j=0 
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so that the independence of { Nz, }7_, follows: 


oo 
QINa, =di,-.-,NB, =dm) = >, Q[Ns =] Q(Na, 
d=dy+--+dm 
= d,...,Ng,, =dm|Ns =d| 
Son e~ PIS] d! 5 
Sar ae aa lla 
d=di+--+dm 


. e~ PI[Bol P(Bo]®, +4 --PIBIPIB, I j 
=>: ea” ea? | aan 7 


do! 
do=0 o jel 


mr e7 Pls] P[B;]% 
d;! 


j=l 
m 
Ile (Np, = 45] 

This calculation in the case m = 1, leaving away the last step shows that Nz 


is Poisson distributed with parameter P[B]. The last step in the calculation 
is then justified. oO 


Remark. The random discrete measure P(w)[B] = Np(w) is a normal- 
ized counting measure on S$ with support on T(w). The expectation of 
the atl measure P(w) is the measure P on S defined by P{B] = 
Sg P(w)[B] dQ(w). But this measure is just P: 


Lemma 5.6.2. P = [, P(w) dQ(w) = 


Proof. Because the Poisson distributed random variable Npg(w) = P(w){[B] 
has by assumption the Q-expectation P[B] = S7~2)k Q[Ne = kl = 
fa P(w)[B] dQ(w) one gets P = fy P(w) dQ) = 0 


Remark. The existence of Poisson processes can also be established by 
assigning to a basis {e; } of the Hilbert space L*(S,P) some independent 
Poisson-distributed random variables Z; = $(e;) and define then a map 

o(f) = XD, aid(e:) if f = SO, ace; The image of this map is a Hilbert 
space of random variables with dot product Cov[¢(f), 6(9)] = (f; g). Define 
Ng = (1g). These random variables have the correct distribution and are 
uncorrelated for disjoint sets B;. 


Definition. A point process is a map II a probability space (2, F,Q) to 
the set of finite subsets of a probability space (S,B, P) such that Ng(w) := 
|v B| is a random variable for all measurable sets B € B. 
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Definition. Assume II is a point process on (S, B,P). "For a function f: 
S—R?t in LD} EC. ay define the random variable 


= F® 


zéll(w) 
Example. For a Poisson process and f = 1g, one gets Us(w) = Naw). 


Definition. The moment generating function of Ly is defined as for any 
‘random variable as : shes 
a Ms, (t), = Ele™™4] . 


It is called the characteristic functional of the point process. 


Example. For a Poisson process and f = alg, the moment generating 
function of Dew) = Na(w) is Blet*N2] = = eP [B\(1—- ae We have computed 
the moment generating function of a Poisson distributed random variable 
in the, first chapter. 


4. 


‘Bxaniple: ‘For ai Poisson process and ape aoe kai 4j1B,, Where Be are disjoint 
sets, we have the characteristic functional 


in . ; ar. | : | | 
: I E[e2**N 25] ee erjnt P(Bj](1—e*5") 
j=l... 7 . | 


Example. For a Poisson process, and f € L'(S,P), the moment generating 
function of Ly is 


Ms, (t) = exp(— fe — exp(tf(z))) dP(z)). eo cst 


This is called Campbell’s theorem. The proof is done by writing f = 

f+ — f-, where both f+ and f~ are pone then approximating 
both functions with step functions fk, = = 5 oF lpt = fe and fe = 
y¢@ pa Se, Sig. Because for Poisson, process, ; the: random variables D 7a 


are cine for different j or different’ sign, the moment generating 
function of X;y is the produes. of ADs moment p ponesenine #1 functions: ae ie = 


eaten 
NG. — 


The next ‘theorem of f Aliréd Rényi: (1921: 1970) gi gives a handy tool to check 
whether a point process, a: random variable TI with values i in’ ‘the set of 
finite subsets of 8, defines a Poisson process. 


Pebaien: A eee in an open aabeet 2 of Ri is isa set 


ais useds 
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Theorem 5.6.3 (Rényi’s theorem, 1967). Let P be a non-atomic probability 
measure on (5, B) and let II be a point process on (2, F, Q). Assume for any 
finite union of k-cubes B C S, Q[Ng = 0] = exp(—P[B)]). Then (S, P, Il, N) 
is a Poisson process with mean measure P. 


Proof (i ) Define O(B) = {wen | Naw) = +0} COQ for any measurable 
set Bin S. By assumption, Q[O(B )] = exp( (—P[B}). ; 


(ii) For m disjoint k-cubes {Bj}71, the sets O(B;) C © are independent. 
Proof: 


ll 


aioe) Git Nees; =O) 


j=1 


Lee exp(~P{\J Bs) 


jul 


= Flen-rp 


= [ao 


j>l 


‘{iii) We count the number of points in an open ‘open subset. U of S using 
-cubes: define ‘for k > 0 the random variable NG§(w) as the number k- 
cubes B for which w € O(B NU). These random variable NE(w) converge 
to Ny(w) for k — oo, for almost all w. 


(iv) For an open set U, the random variable Ny is Poisson distributed 
with parameter P{U]. Proof: we compute its moment generating function. 
Because for different k-cubes, the sets O(B;) C O(U) are independent, 
the moment generating function of NK = 7 5 L0( B)3) is the prodiict of the 
moment ‘Benerating functions of lor B)j)" 


Ee] = TT (@ior B)] +<4(1 —Q10(B))) 


k-—cube B 


Il 


T] (exp(-P{B)) + ef(1 — exp(—P{B)))) - 
k-—cube B 


Each factor of this product is positive and the monotone convergence the- 


orem shows that the moment generating function of Ny is 


E[et%¢} = Jim. I (exp(—P[B]) + et(l — exp(—P[B]))) . 
k—cube B 
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which converges to exp(P[U](1 — e*)) for k — oo if the measure P is non- 
atomic. 


Because the generating function determines the distribution of Ny, this 
assures that the random variables Ny are Poisson distributed with param- 
eter P[U]. 


(v) For any disjoint open sets U;,...,Um, the random variables { Ny, Wie 
are independent. Proof: the candle variables {N*  }fe1 are fade pendant 
for large enough k, because no k-cube can be in more than one of the sets 
U;, The random variables {Nf ,)}f1 are then independent for fixed k. Let- 
ting k — oo shows that the variables Nu, are independent. 


(vi) To extend (iv) and (v) from open sets to arbitrary Borel sets, one 
can use the characterization of a Poisson process by its moment generating 
function of f € L'(S,P). If f = oajilu, for disjoint open sets U; and 
real numbers a;, we have seen that the characteristic functional is the 
characteristic functional of a Poisson process. For general f € L‘S,P) the 
characteristic functional is the one of a Poisson process by approximation 
and the Lebesgue dominated convergence theorem. Use f = 1g to verify 
that Nz is Poisson distributed and f = }>a;lg, with disjoint Borel sets 
B; to see that {Ng,)}7, are independent. a) 


5.7 Random maps 


Definition. Let (Q,A,P) be a probability space and M be a manifold with 
Borel o-algebra B. A random diffeomorphism on M is a measurable map 
from M x Q— M so that t+ f(x,w) is a diffeomorphism for all w € 2. 
Given a P measure preserving transformation T on Q, it defines a cocycle 


S(2,w) = (F(,w), Tw) 
which is a map on M x 2. 
Example. If M is the circle and f(z,c) = x + csin(z) is a circle diffeomor- 
phism, we can iterate this map and assume, the parameter c is given by 
IID random variables which change in each iteration. We can model this 


by taking (Q,A,P) = ({0,1]8, BN, vN) where v is a measure on (0, 1] and 
take the shift T(z) = tn41 and to define 


S(z,w) = (f(t, wo),T(w)) . 


Iterating this random logistic map is done by taking IID random variables 
Cn with law v and then iterate 


Lo, t1 = f (Xo, co), 22 = f(z1,€1)... . 
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Example. If (2,A,P,T) is an ergodic dynamical system, and A: 2 — 
SL(d,R) is measurable map with values in the special linear group SL(d, R) 
of all d x d matrices with determinant 1. With M = R¢, the random 
diffeomorphism f(z,v) = A(zx)v is called a matrix cocycle. One often uses 
the notation 


A(x) = A(T?~*(2)) - A(T™*(a)) +» A(T(2)) - A) 


for the n’th iterate of this random map. 


Example. If M is a finite set {1,..,n} and P = Py is a Markov transition 
matrix, a matrix with entries P;; > 0 and for which the sum of the column 
elements is 1 in each column. A random map for which f(x:,w) = 2; with © 
probability P;; is called a finite Markov chain. 


Random diffeomorphisms are examples of Markov chains as covered in Sec- 
tion (3.14) of the chapter on discrete stochastic processes: 


ve 


Lemma 5.7.1. a) Any random map defines transition probability functions 
P:MxB-—- (0,1): 


P(x, B) = Pif(x,w) € B). 


b) If An is a filtration of o-algebras and Xn(w) = T"(w) is An adapted, 
then P is a discrete Markov process. 


nal 


Proof. a) We have to check that for all x, the measure P(z,-) is a prob- 
ability measure on M. This is easily be done by checking all the axioms. 
We further have to verify that for all B € B, the map x — P(2, B) is 
B-measurable. This is the case because f is a diffeomorphism and so con- 
tinuous and especially measurable. 

b) is the definition of a discrete Markov process. O 


Example. If Q = (AN, F®,v¥) and T(z) is the shift, then the random map 
defines a discrete Markov process. 


Definition. In case, we get IID A-valued random variables Ag = Te). 
A random map f(z,w) defines so a IID diffeomorphism-valued random 
variables f;(x)(w) = f(x, X1(w)), fe(x) = f(a, X2(w)). We will call a ran- 
dom diffeomorphism in this case an IID random diffeomorphism. If the 
transition probability measures are continuous, then the random diffeomor- 
phism is called a continuous IID random diffeomorphism. If f(z, w) depends 
smoothly on w and the transition probability measures are smooth, then 
the random diffeomorphism is called a smooth IID random diffeomorphism. 
It is important to note that ”continuous” and ” smooth” in this definition is 
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only with respect to the transition probabilities that A must have at least 
dimension d > 1. With respect to M, we have already assumed smoothness 
from the beginning. 


Definition. A measure ys on M is called a stationary measure for the random 
diffeomorphism if the measure p x P is invariant under the map S. 


Remark. If the random diffeomorphism defines a Markov process, the sta- 
tionary measure p is a stationary measure of the Markov process. 


Example. If every diffeomorphism « — f (z,w) from w € Q preserves a 
measure 1, then y is a automatically a stationary measure. 


Example. Let M = T? = R?/Z? denote the two-dimensional torus. It is a 
group with addition modulo 1 in each coordinate. Given an IID random 
map: 

_ Jj «+a with probability 1/2 
fala) = { z+ with probability 1/2 


Each map either rotates the point by the vector a = (a1,@2) or by the 
vector 8 = (3,2). The Lebesgue measure on T? is invariant because 
it is invariant for each of the two transformations. If a and 8 are both 
rational vectors, then there are infinitely many ergodic invariant measures. 
For example, if a = (3/7,2/7),@ = (1/11,5/11) then the 77 rectangles 
[t/7, (¢ + 1)/7] x (9/11, (7 + 1)/11] are permuted by both transformations. 


Definition. A stationary measure py of a random diffeomorphism is called 
ergodic, if x P is an ergodic invariant measure for the map S on (M x 
Q, x P). 


Remark. If yu is a stationary invariant measure, one has 


(A) = ie P(x, A) dy 


for every Borel set A € .A. We have earlier written this as a fixed point 
equation for the Markov operator P acting on measures: Py = p. In the 
context of random maps, the Markov operator is also called a transfer 
operator. 


Remark. Ergodicity especially means that the transformation T on the 
base probability space” (2, A, P) is ergodic. 


Definition. The support of a measure pu is the complement of the open set 
of points x for which there is a neighborhood U with u(U) = 0. It is by 
definition a closed set. 


The previous example 2) shows that there can be infinitely many ergodic in- 
variant measures of a random diffeomorphism. But for smooth IID random 
diffeomorphisms, one has only finitely many, if the manifold is compact: 
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Theorem 5.7.2 (Finitely many ergodic stationary measures (Doob)). If M 
is compact, a smooth IID random diffeomorphism has finitely many ergodic 
stationary measures j;. Their supports are mutually disjoint and separated 
by open sets. 


Proof. (i) Let 4 and 2 be two ergodic invariant measures. Denote by 21 
and ¥2 their support. Assume L, and Lz are not disjoint. Then there ex- 
ist points x; € D, and open sets U; of z; so that the transition probability 
P(21, U2) is positive. This uses the assumption that the transition probabil- 
ities have smooth densities. But then po(U x 2) = 0 and p2(S(U x )) > 0 
violating the measure preserving property of S. 


(ii) Assume there are infinitely many ergodic invariant measures, there 
exist at least countably many. We can enumerate them as #41, j/2, ... Denote 
by %; their supports. Choose a point y; in L;. The sequence of points 
has an accumulation point y € M by compactness of M. This implies 
that an arbitrary e-neighborhood U of y intersects with infinitely many %;. 
Again, the smoothness assumption of the transition probabilities P(y, -) 
contradicts with the S invariance of the measures py; having supports 24. 

O 


Remark. If j11, 2 are stationary probability measures, then Ap +(1—A)pe 
is an other stationary probability measure. This theorem implies that the 
set of stationary probability measures forms a closed convex simplex with 
finitely many corners. It is an example of a Choquet simplex. 


5.8 Circular random variables 


Definition. A measurable function from a probability space (Q2,A,P) to 
the circle (T, B) with Borel o-algebra B is is called a circle-valued random 
variable. It is an example of a directional random variable. We can realize 
the circle as T = [—z,7) or T = [0, 27) = R/(27Z). 


Example. If (2,.A,P) = (R, A, e~? /2/,/2nda, then X(x) = x mod 27 isa 
circle-valued random variable. In general, for any real-valued random vari- 
able Y, the random variable X(x) = X mod 27 is a circle-valued random 
variable. 


Example. For a positive integer k, the first significant digit is X(k) = 
2m logyo(k) mod 1. It is a circle-valued random variable on every finite 
probability space (Q = {1,...,2 }, A, P[{k}] = 1/n). 
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Example. A dice takes values in 0,1,2,3,4,5 (count 6 = 0). We roll it two 
times, but instead of adding up the results X and Y, we add them up 
modulo 6. For example, if X = 4 and Y = 3, then X + Y = 1. Note that 
E[X + Y] = E[X] # E[X] + E[Y]. Even if X is an unfair dice and if Y is 
fair, then X + Y is a fair dice. 


Definition. The law of a circular random variable X is the push-forward 
measure p = X*P on the circle T. If the law is absolutely continuous, it 
has a probability density function fx on the circle and w = fx(x)dz. As 
on the real line the Lebesgue decomposition theorem (2.12.2) assures that 
every measure on the circle can be decomposed pt = [pp + Hac + Msc, Where 
Lpp is (Pp), Hse is (sc) and flac is (ac). 


Example. The law of the wrapped normal distribution in the first example 
is a measure on the circle with a smooth density 


fice) = Sy ete, 


k=—00 


It is an example of a wrapped normal distribution. 


Example. The law of the first significant digit random variable X,(k) = 
27 log,9(kK) mod 1 defined on {1,...,n } is a discrete measure, supported 
on {k27/10|0 < k < 10 }. It is an example of a lattice distribution. 


Definition. The entropy of a circle-valued random variable X with prob- 
ability density function fx is defined as H(f) = — m4 ” f(x) log(f(xz)) dx. 
The relative entropy for two densities is defined as 


Qn 
H(f\g) = i f(a) log(f(«)/9(a)) de. 


The Gibbs inequality lemma (2.15.1) assures that H(f|g) > 0 and that 
H(f\g) =0, if f = g almost everywhere. 


Definition. The mean direction m and resultant length p of a circular 
random variable taking values in {|z| = 1} Cc C are defined as 


pe’™ a Ele*] . 


One can write p = E[cos(X — m)]. The circular variance is defined as 
V =1-p = E[l — cos(X — m)] = E[(X — m)?/2 — (X — m)*/4!...]. 
The later expansion shows the relation with the variance in the case of 
real-valued random variables. The circular variance is a number in [0, 1). If 
p = 0, there is no distinguished mean direction. We define m = 0 just to 
have one in that case. . 


5.8. Circular random variables 321 


Example. If the distribution of X is located a single point Zo, then p = 
1,m = ap and V = 0. If the distribution of X is the uniform distribution 
on the circle, then p = 0, V = 1. There is no particular mean direction in 


this case. For the wrapped normal distribution m = 0,p = ert 2-V = 
Leo 72. , 


The following lemma is analogous to theorem (2.5.5): 
a 


Theorem 5.8.1 (Chebychev inequality on the circle). If X is a circular 
random variable with circular mean m and variance V, then 


Pl|sin((X — m)/2)| >< pa 


Proof. We can assume without loss of generality that m = 0, otherwise 
replace X with X — m which does not change the variance. We take T = 
[—7, 7). We use the trigonometric identity 1 — cos(x) = 2sin?(x/2), to get 


V = El —cos(X)] = 2bjsin?(5) 
2 2E[] | sin(X)1>6 sin(5)] 
2pI| ein( > 
> 2 P[|sin(=)I >e]. 


O 


Example. Let X be the random variable which has a discrete distribution 
with a law supported on the two points z = ro = 0 and © = Zi = 
+2 arcsin(e) and P[X = xo] = 1— V/(2e?) and P[X = 24] = V/ (4e?). This 
distribution has the circular mean m and the variance V. The equality 


P[|sin(X/2)| 2 €] = 2V/(4e?) = V/(2e") . 


shows that the Chebychev inequality on the circle is ”sharp”: one can not , 
improve it without further assumptions on the distribution. 


Definition. A sequence of circle-valued random variables X, converges 
weakly to a circle-valued random variable X if the law of Xn converges 
weakly to the law of X. As with real valued random variables weak con- 
vergence is also called convergence by law. 


Example. The sequence X,, of significant digit random variables X, con- 
verges weakly to a random variable with lattice distribution P[X = k] = 
logio(k + 1) — logig(k) supported on {k27/10 | 0 < k < 10 }. It is called 
the distribution of the first significant digit. The interpretation is that if 
you take a large random number, then the probability that the first digit 
is 1 is log(2), the probability that the first digit is 6 is log(7/6). The law is 
also called Benford’s law. 
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Definition. The characteristic function of a circle-valued random variable 
X is the Fourier transform ¢x = 7 of the law of X. It is a sequence (that 
is a function on Z) given by 


$x(n) = Efe™*] = [ e'” dyx (zx) . 


Definition. More generally, the characteristic function of a T¢-valued ran- 
dom variable (circle-valued random vector) is the Fourier transform of the 
law of X. It is a function on Z4 given by 


éx(n) = Efe") = [ e dux (2). 


The following lemma is analog to corollary (2.17). 


_ eee 


Lemma 5.8.2. A sequence X,, of circle-valued random variables converges 
in law to a circle-valued random variable X if and only if for every integer 
k, one has $x, (k) + x(k) for n — oo. 


eee 


Example. A circle valued random variable with probability density function 
f(x) = Ce*°°s(=-) ig called the Mises distribution. It is also called the 
circular normal distribution. The constant C is 1/(2mIo(«)), where Ip («) = 
yn-0(*/2)?"/(n!?) a modified Bessel function. The parameter « is called 
the concentration parameter, the parameter a is called the mean direction. 
For « — 0, the Mises distribution approaches the uniform distribution on 
the circle. 


Figure. The density function of 
the Mises distribution plotted as a 
polar graph. 


Figure. The density function of 
the Mises distribution on [—7, nr]. 
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Proposition 5.8.3. The Mises distribution maximizes the entropy among all 
circular distributions with fixed mean a and circular variance V.. 


Proof. If g is the density of the Mises distribution, then log(g) = « cos(x — 
a) + log(C) and H(g) = Kp + 27 log(C). 
Now compute the relative entropy 


0> H(Flg) = [ Ha) oe(fa))az — [ F(@)log(ale))ax 
This means with the resultant length p of f and g: 
H(f) > —E|kcos(x — a) + log(C)] = —Kp + 2m log(C’ -= H(g) . 
O 


Definition. A circle-valued random variable with probability density func- 
tion 


] = 2 
a —(x—a—2kr) 2 
r)= e 20 
f(z) a ee 


is the wrapped normal distribution. It is obtained by taking the normal 
distribution and wrapping it around the circle: if X is a normal distribu- 
tion with mean a and variance o*, then X mod 1 is the wrapped normal 
distribution with those parameters. 


Example. A circle-valued random variable with constant density is called 
a random variable with the uniform distribution. 


Example. A circle-valued random variable with values in a closed finite 
subgroup H of the circle is called a lattice distribution. For example, the 
random variable which takes the value 0 with probability 1/2, the value 
2/3 with probability 1/4 and the value 47/3 with probability 1/4 is an 
example of a lattice distribution. The group H is the finite cyclic group Z3. 


Remark. Why do we bother with new terminology and not just look at real- 
valued random variables taking values in [0, 27)? The reason to change the 
language is that there is a natural addition of angles given by rotations. 
Also, any modeling by vector-valued random variables is kind of arbitrary. 
An advantage is also that the characteristic function is now a sequence and 
no more a function. 


Parameter 
bx(h) = 
Puniform |__| @x(K) =O for FO and dx(0)=1 
[Mises [Ra =0_| Inte) Tole) 

[wrapped normal | o,a=0 [eFe/?=pK 
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The functions I,(«) are modified Bessel functions of the first kind of k’th 
order. 


Definition. If X,, X2,... is a sequence of circle-valued random variables, 
define S, = X, +---+ Xp. 


CC 


Theorem 5.8.4 (Central limit theorem for circle-valued random variable). 
The sum S, of [ID-valued circle-valued random variables X; which do 
not have a lattice distribution converges in distribution to the uniform 
distribution. 


Proof. We have |¢x(k)| < 1 for all k 4 0 because if 6x(k) = 1 for some 
k # 0, then X has a lattice distribution. Because $s, (k) = Te: $x, (A), 
all Fourier coefficients ¢g, (k) converge to 0 for n > 00 for k # 0. ma 


Remark. The IID property can be weakened. The Fourier coefficients 
$x, (k) =1— ank 


should have the property that }~°°_, ang diverges, for all k, because then, 
[p11 — @nk) — 0. If X; converges in law to a lattice distribution, then 
there is a subsequence, for which the central limit theorem does not hold. 


Remark. Every Fourier mode goes to zero exponentially. If ¢ x(k) < 1-6 
for 6 > 0 and all k ¥ 0, then the convergence in the central limit theorem 
is exponentially fast. 


Remark. Naturally, the usual central limit theorem still applies if one con- 
siders a circle-valued random variable as a random variable taking values in 
[—7, 7] Because the classical central limit theorem shows that ie Xn/ Vn 
converges weakly to a normal distribution, }>;-_, Xn/./n mod 1 converges 
to the wrapped normal distribution. Note that such a restatement of the 
central limit theorem is not natural in the context of circular random vari- 
ables because it assumes the circle to be embedded in a particular way in 
the real line and also because the operation of dividing by n is not natural 
on the circle. It uses the field structure of the cover R. 


Example. Circle-valued random variables appear as magnetic fields in math- 
ematical physics. Assume the plane is partitioned into squares [j, j + 1) x 
[k, k+1) called plaquettes. We can attach IID random variables Bjx = e's 
on each plaquette. The total magnetic field in a region G is the product of 
all the magnetic fields B;, in the region: 


I] Bip = eines Xie 
(i,.k)EG 


The central limit theorem assures that the total magnetic field distribution 
in a large region is close to a uniform distribution. 
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Example. Consider standard Brownian motion B; on the real line and its 
graph of {(t, B;) | t € R } in the plane. The circle-valued random variables 
Xn = Bn mod 1 gives the distance of the graph at time ft = n to the 
next lattice point below the graph. The distribution of X, is the wrapped 
normal distribution with parameter m = 0 and o = n. 


Figure. The graph of one- 
dimensional Brownian motion 
with a grid. The stochastic pro- 
cess produces a circle-valued ran- 
dom variable X,, = By, mod 1. 


If X,Y are real-valued IID random variables, then X +Y is not independent 
of X. Indeed X + Y and Y are positively correlated because 


Cov[X + Y, Y] = Cov[X, Y] + Cov[Y, Y] = Cov[Y, ¥] = Var[Y] >0. 


The situation changes for circle-valued random variables. The sum of two 
independent random variables can be independent to the first random vari- 
able. Adding a random variable with uniform distribution immediately ren- 
ders the sum uniform: 


a 


Theorem 5.8.5 (Stability of the uniform distribution). If X,Y are circle- 
valued random variables. Assume that Y has the uniform distribution and 
that X,Y are independent, then X + Y is independent of X and has the 
uniform distribution. 


Proof. We have to show that the event A = {X + Y €¢ [c,d] } is indepen- 
dent of the event B = {X € [a,}] }. To do so we calculate P[AN B] = 


iE aa fx(z) fy (y) dydz. Because Y has the uniform distribution, we get 
after a substitution u = y — Zz, 


b pd—-zx b pd 
[ [- sxe@tew aude = [Of feo fro dude = PAPE) 


By looking at the characteristic function ¢x+y = dxdy = ox, we see that 
X +Y has the uniform distribution. oO 
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The interpretation of this lemma is that adding a uniform random noise to 
a given uniform distribution makes it uniform. 


On the n-dimensional torus T¢, the uniform distribution plays the role of 
the normal distribution as the following central limit theorem shows: 


SS 


Theorem 5.8.6 (Central limit theorem for circular random vectors). The 
sum S,, of IID-valued circle-valued random vectors X converges in distri- 
bution to the uniform distribution on a closed subgroup H of G. 


eT eee 


Proof. Again |@x(k)| < 1. Let A denote the set of k such that x(k) = 1. 


(i) A is a lattice. If f e**@) dz = 1 then X (x)k = 1 for all x. If A, Ag are 
in A, then Ay +2 € A. 


(ii) The random variable takes values in a group H which is the dual group 
of Z4/H. 


(iii) Because $s, (k) = [];_, x, (k), all Fourier coefficients os, (k) which 
are not 1 converge to 0. 


(iv) $s, (k) > 14, which is the characteristic function of the uniform dis- 
tribution on H. Oo 


Example. If G = T? and A = {...,(—1,0), (1,0), (2,0),... }, then the ran- 
dom variable X takes values in H = {(0,y)| y € T! }, a one dimensional 
circle and there is no smaller subgroup. The limiting distribution is the 
uniform distribution on that circle. 


Remark. If X is a random variable with an absolutely continuous distribu- 
tion on T¢, then the distribution of S, converges to the uniform distribution 
on T4. 


Exercice. Let Y be a real-valued random variable which has standard 
normal distribution. Then X(z) = Y(z) mod 1 is a circle-valued ran- 
dom variable. If Y; are IID normal distributed random variables, then 
Sn = Yi +---+Y¥n mod 1 are circle-valued random variable. What is 
Cov[Sn, Sm]? 


The central limit theorem applies to all compact Abelian groups. Here is 
the setup: 
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Definition. A topological group G is a group with a topology so that addi- 
tion on this group is a continuous map from G x G — G and such that the 
inverse z — x~! from G to G is continuous. If the group acts transitively 
as transformations on a space H, the space H is called a homogeneous 
space. In this case, H can be identified with G/G,, where G, is the isotopy 
subgroup of G consisting of all elements which fix a point 2. 


Example. Any finite group G with the discrete topology d(x, y) = lif #y 
and d(z, y) = 0 if x = y is a topological group. 


Example. The real line R with addition or more generally, the Euclidean 
space R¢ with addition are topological groups when the usual Euclidean 
distance is the topology. 


Example. The circle T with addition or more generally, the torus T? with 
addition is a topological group with addition. It is an example of a compact 
Abelian topological group. 


Example. The general linear group G = Gi(n,R) with matrix multiplica- 
tion is a topological group if the topology is the topology inherited as a sub- 
set of the Euclidean space R”’ of nxn matrices. Also subgroup of Gl(n,R), 
like the special linear group SL(n,R) of matrices with determinant 1 or 
the rotation group SO(n, R) of orthogonal matrices are topological groups. 
The rotation group has the sphere S” as a homogeneous space. 


Definition. A measurable function from a probability space (2,A,P) to 
a topological group (G, B) with Borel o-algebra B is is called a G-valued 
random variable. 


Definition. The law of a spherical random variable X is the push-forward 
measure = X*P on G. 


Example. If (G,A,P) is a the probability space by taking a compact topo- 
logical group G with a group invariant distance d, a Borel o-algebra A and 
the Haar measure P, then X(z) = z is a group valued random variable. 
The law of X is called the uniform distribution on G. 


Definition. A measurable function from a probability space (2, A, P) to the 
group (G, B) is called a G-valued random variable. A measurable function 
to a homogeneous space is called H-valued random variable. Especially, 
if H is the d-dimensional sphere ($4,8) with Borel probability measure, 
then X is called a spherical random variable. It is used to describe spherical 
data. 


5.9 Lattice points near Brownian paths 


The following law of large numbers deals with sums S,, of n random vari- 
ables, where the law of random variables depends on n. 


328 Chapter 5. Selected Topics 


Theorem 5.9.1 (Law of large numbers for random variables with shrinking 
support). If X; are IID random variables with uniform distribution on (0, 1]. 
Then for any 0 < 6 < 1, and A, = [0,1/n®], we have 


no N 


ne ee 
lim <5 51a, (Xk) 9 1 
k=1 


in probability. For 6 < 1/2, we have almost everywhere convergence. 


Proof. For fixed n, the random variables Z;,(x) = 1{0,1/n5](Xk) are indepen- 
dent, identically distributed random variables with mean E(Z,] = p = 1/né 
and variance p(1 —p). The sum S, = }>7_, Xx has a binomial distribution 
with mean np = n'~? and variance Var[S,] = np(1 — p) = n~§(1 — p). 
Note that if n changes, then the random variables in the sum Sp change 
too, so that we can not invoke the law of large numbers directly. But the 
tools for the proof of the law of large numbers still work. 


For fixed € > 0 and n, the set 


B, = {x € [0,1] | | 


Sn(x 
anf) -1|>e} 


has by the Chebychev inequality (2.5.5), the measure 


Sn 4,2 Var[S,]  1l-—p 1 
P[Bn] < Var[-s=5]/€° = ne~2e2 gin i—-6 = Gps * 


This proves convergence in probability and the weak law version for all 
6 < 1 follows. 


In order to apply the Borel-Cantelli lemma (2.2.2), we need to take a sub- 
sequence so that )>7-, P[Bn,] converges. Like this, we establish complete 
convergence which implies almost everywhere convergence. 


Take « = 2 with «(1 — 6) > 1 and define n, = k* = k?. The event B = 
lim sup, By, has measure zero. This is the event that we are in infinitely 
many of the sets B,,. Consequently, for large enough k, we are in none of 
the sets B,,: if x € B, then 


aa a <6 
nr, 


for large enough k. Therefore, 


Ny, Ny k 
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Because for nj, = k? we have ne41 — 1 = 2k + 1 and 


Si(T2(z)) . 2k+1 
nim = 20-6) * 


For 6 < 1/2, this goes to zero assuring that we have not only convergence 
of the sum along a subsequence S,, but for S, (compare lemma (2.11.2)). 
We know now | Se —1| + 0 almost everywhere for n — oo. 0 


Remark. If we sum up independent random variables Z; = nlio1 /nd|(Xk) 

where X; are IID random variables, the moments E[Z7] = n(™~))8 be- 

come infinite for m > 2. The laws of large numbers do not apply be- 

cause E[Z?] depends on n and diverges for n — oo. We also change the 

random variables, when taking larger sums. For example, the assumption 
1 n 

SUP, = 2 j=1 Var[Xi] < 00 does not apply. 


Remark. We could not conclude the proof in the same way as in theo- 
rem (2.9.3) because Un = )°y—, Ze is not monotonically increasing. For 
6 € [1/2,1) we have only proven a weak law of large numbers. It seems 
however that a strong law should work for all 6 < 1. 


Here is an application of this theorem in random geometry. 


Corollary 5.9.2. Assume we place randomly n discs of radius r = 1/ ni/2—-8/2 
onto the plane. Their total area without overlap is mnr? = nn®. If Sy is the 
number of lattice points hit by the discs, then for 6 < 1/2 


LAST 
no 
almost surely. 
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Remark. Similarly as with the Buffon needle problem mentioned in the in- 
troduction, we can get a limit. But unlike the Buffon needle problem, where 
we keep the setup the same, independent of the number of experiments. We 
adapt the experiment depending on the number of tries. If we make a large 
number of experiments, we take a small radius of the disk. The case 6 = 0 
is the trivial case, where the radius of the disc stays the same. 


The proof of theorem (5.9.1) shows that the assumption of independence 
can be weakened. It is enough to have asymptotically exponentially decor- 
related random variables. 


Definition. A measure preserving transformation T of [0,1] has decay of 
correlations for a random variable X satisfying E[X] = 0, if 


Cov[X, X(T")] 3 0 


for n — ov. If 
Cov[X, X(T”)] < e~" 


for some constant C' > 0, then X has exponential decay of correlations. 


Lemma 5.9.3. If B, is standard Brownian motion. Then the random vari- 
ables X, = B, mod 1 have exponential decay of correlations. 


Proof. B,, has the standard normal distribution with mean 0 and standard 
deviation o = n. The random variable X,, is a circle-valued random variable 
with wrapped normal distribution with parameter 0 = n. Its characteris- 
tic function is ¢x(k) = e-*’07/2| We have Xnim = Xn + Ym mod 1, 
where X, and Y,, are independent circle-valued random variables. Let 
Gn = Deg e* ™/? cos(ka) = 1— (xz) > 1—e-©”” be the density of Xn 
which is also the density of Y,,. We want to know the correlation between 
Xnim and Xn: 


ie pi 
i / f(x) f(z + y)g(z)g(y) dy dex . 


With u = x + y, this is equal to 


1 1 
ff s@ttoo@otu - 2) dude 
0 0 


1 1 
[ [ f(=)f(u)(1 — e(2))(1 — e(u — 2) dude 
0 0 


_ m2 
Cy fle" 


1A 
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Proposition 5.9.4. If T : [0,1] — [0,1] is a measure-preserving transfor- 
mation which has exponential decay of correlations for X;. Then for any 
6 € [0,1/2), and A, = [0,1/n‘], we have 


_ tle 
dim, =e Do daa(TH@)) + 1. 
k=1 


Proof. The same proof works. The decorrelation assumption implies that 
there exists a constant C such that 


>> Cov[Xi, Xj] <C. 


iAjén 
Therefore, 


Var[S,,] = nVar[X,,] + >, Cov[Xi, Xj] < Cal f[2, Ss) eC 5)? | 


ifjen ijgn 


The sum converges and so Var[S,,] = nVar[X;] + C. Oo 


Remark. The assumption that the probability space 2 is the interval [0, 1] is 
not crucial. Many probability spaces (2,.A,P) where 2 is a compact metric 
space with Borel o-algebra A and P[{z}] = 0 for all z € 2 is measure 
theoretically isomorphic to ({0,1],8,dr), where B is the Borel o-algebra 
on [0,1] (see [13] proposition (2.17). The same remark also shows that 
the assumption A, = [0,1/n‘] is not essential. One can take any nested 
sequence of sets A, € A with P[A,] = 1/n*, and Anyi C An. 


Figure. We can apply this propo- 
sition to a lattice point prob- 
lem near the graphs of one- 
dimensional Brownian motion, 
where we have a probability space 
of paths and where we can make 
a statement about almost every 
path in that space. This is a re- 
sult in the geometry of numbers 
for connected sets with fractal 
boundary. 
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Corollary 5.9.5. Assume B; is standard Brownian motion. For any 0 < 6 < 
1/2, there exists a constant C, such that any 1/n!+° neighborhood of the 
graph of B over [0,1] contains at least C/n~® lattice points, if the lattice 
has a minimal spacing distance of 1/n. 


Proof. B:41/n mod 1/n is not independent of B; but the Poincaré return 
map T from time t = k/n to time (k + 1)/n is a Markov process from. 
[(0,1/n] to [0,1/n] with transition probabilities. The random variables X; 
have exponential decay of correlations as we have seen in lemma (5.9.3). O 


Remark. A similar result can be shown for other dynamical systems with 
strong recurrence properties. It holds for example for irrational rotations 
with T(z) = +a mod 1 with Diophantine a, while it does not hold for 
Liouville a. For any irrational a, we have fn = ss )>pa1 1a, (T*(z)) near 
1 for arbitrary large n = q, where p;/q is the periodic approximation of 
6. However, if the q are sufficiently far apart, there are arbitrary large n, 
where f,, is bounded away from 1 and where f,, do not converge to 1. 


The theorem we have proved above belongs to the research area of geome- 
try of numbers. Mixed with probability theory it is a result in the random 
geometry of numbers. 


A prototype of many results in the geometry of numbers is Minkowski’s 
theorem: 


Theorem 5.9.6 (Minkowski theorem). A convex set M which is invariant 
under the map T(z) = —z and with area > 4 contains a lattice point 
different from the origin. 


Proof. One can translate all points of the set M back to the square 2 = 
[-1,1] x [-1,1]. Because the area is > 4, there are two different points 
(x,y), (a,b) which have the same identification in the square 2. But if 
(2, y) = (u+2k, v+2l) then (x—u, y—v) = (2k, 21). By point symmetry also 
(a, b) = (—u, —v) is in the set M. By convexity ((x+a)/2, (y+6)/2) = (k, » 
is in M. This is the lattice point we were looking for. 
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Figure. A convex, symmetric set 
M. For illustration purposes, the 
area has been chosen smaller 
than 4 in this picture. The theo- 
rem of Minkowski assumes, tt is 
larger than 4. 


Figure. Translate all points back 
to the square [—1,1] x [—1,1] of 
area 4. One obtains overlapping 
points. The symmetry and con- 
verity allows to conclude the ex- 
istence of a lattice point in M. 


There are also open questions: 


e The Gauss circle problem asks to estimate the number of 1/n-lattice 
points g(n) = mn? + E(n) enclosed in the unit disk. One believes that 
an estimate E(n) < Cn® holds for every 6 > 1/2. The smallest @ for 
which one knows the is 6 = 46/73. 


e For asmooth curve of length 1 which is not a line, we have a similar 
result as for the random walk but we need 6 < 1/3. Is there a result 
for 6 < 1? 


e If we look at Brownian motion in R“’. How many 1/n lattice points 
are there in a Wiener sausage, in a 1/n!+° neighborhood of thé path? 


5.10 Arithmetic random variables 


Because large numbers are virtually infinite - we have no possibility to in- 
spect all of of the numbers from 2, = {1,...n = 10!°°} for example - 
functions like X,, = k? +5 mod n are accessible on a small subset only. The 
function X,, behaves as random variable on an infinite probability space. If 
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we could find the events U,, = {Xn = 0 } easily, then factorization would 
be easy as its factors can be determined from in U,. A finite but large 
probability space Q,, can be explored statistically and the question is how 
much information we can draw from a small number of data. It is unknown 
how much information can we get from a large integer n with finitely many 
computations. Can, we statistically recover the factors of n from O(log(n)) 
data points (k;, 2; ), where xz; =nmod k; for example? 


As an illustration of how arithmetic complexity meets randomness, we con- 
sider in this section examples of number theoretical random variables, which 
can be computed with a fixed number of arithmetic operations. Both have 
the property that they appear to be random” for large n. These functions 
belong to a class of random variables 


X (k) = p(k, n) mod q(k,n) ’ 


where p and q are polynomials in two variables. For these functions, the 
sets X-1(a) = {X(k) = a } are in general difficult to compute and 
Yo(k) = X(k),Yi(k) = X(k+1),...,¥i(k) = X(k +1) behave very much 
as independent random variables. 


To deal with ” number theoretical randomness” , we use the notion of asymp- 
totically independence. Asymptotically independent random variables ap- 
proximate independent random variables in the limit n — oo. With this 
notion, we can study fixed sequences or deterministic arithmetic functions 
on finite probability spaces with the language of probability, even so there is 
no fixed probability space on which the sequences form a stochastic process. 


Definition. A sequence of number theoretical random variables is a col- 
lection of integer valued random variables X,, defined on finite probability 
spaces (Qn,An, Pn) for which 2, C Qn41 and A, is the set of all subsets 
of 2,. An example is a sequence X,, of integer valued functions defined 
-on 2, = {0,...,n—1 }. If there exists a constant C such that X, on 
{0,...,n } is computable with a total of less than C' additions, multiplica- 
tions, comparisons, greatest common divisor and modular operations, we 
call X a sequence of arithmetic random variables. 


Example. For example 
Xp(x) = (((2° — 7) mod 9)?a ~ 2”) mod n 
defines a sequence of arithmetic random variables on 2,, = {0,...,n—1 }. 


Example. If x, is a fixed integer sequence, then X,(k) = zp on Qn = 
{0,...,%—1 } is a sequence of number theoretical random variables. For 
example, the digits z, of the decimal sequence of 7 defines a sequence 
of number theoretical random variables X,(k) = ry for k < n. However, 
in the case of z, it is not known, whether this sequence is an arithmetic 
sequence. It would be a surprise, if one could compute x, with a finite n- 
independent number of basic operations. Also other deterministic sequences 
like the decimal expansions of 7, /2 or the Mébius function p(n) appear 
random”. , 
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Remark. Unlike for discrete time stochastic processes Xn, where all ran- 
dom variables X,, are defined on a fixed probability space (Q,A,P), an 
arithmetic sequence of random variables X, uses different finite probabil- 
ity spaces (Qn, An, Pn). 


Remark. Arithmetic functions are a subset of the complexity class P of 
functions computable in polynomial time. The class of arithmetic sequences 
of random variables is expected to be much smaller than the class of se- 
quences of all number theoretical random variables. Because computing 
gcd(z,y) needs less than C(x + y) basic operations, we have included it 
too in the definition of arithmetic random variable. 


Definition. If limp. E[Xn] exists, then it is called the asymptotic expec- 
tation of a sequence of arithmetic random variables. If limn—o Var[Xn] 
exists, it is called the asymptotic variance. If the law of X, converges, the 
limiting law is called the asymptotic law. 


Example. On the probability space Q, = [1,..., n|x{1,...,n], consider the 
arithmetic random variables Xq = 1s,, where Sq = {(n,m),gced(n,m) = 


d }. 
fae ace ee ig ee 


Proposition 5.10.1. The asymptotic expectation P,[Si] = En [X,] is 6/7. 
In other words, the probability that two random integers are relatively 
prime is 6/7. 


a 


Proof. Because there is a bijection ¢ between S; on [I,... nj? and Sy on 
[1,...,dn]* realized by $(j,k) — (dj,dk), we have |Si|/n? = |Sa|/(d?n?). 
This shows that En[Xi]/En[Xa] — d? has a limit 1/d? for n — oo. To 
know P[S;], we note that the sets Sq form a partition of N? and also when 
restricted to Qn. Because P[Sa] = P[Si|/d?, one has 


1 


1 1 n 
PISi| Ga hag? got SPS Sts 


so that P[S,] = 6/7. D0 
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Figure. The probability that two = 
random integers are relatively ata Sate fare Gents 
prime is 6/n?. A cell (j,k) 


in the finite probability space 


[1,...,n] x [l,...,n] is painted os Pr is 

black if gcd(j,k) = 1. The proba- one” . 
bility that gcd(j,k) = 1 is 6/n? = sTecsa"s 5 oon Ee" Sea's f avebd o's 
0.607927... in the limit n > oo. CL Gwin ate ce dale cue wee 
So, if you pick two large num- tbc trgati ‘, - soar ee 

bers (j,k) at random, the change Blan Bae age 

to have no common divisor is aus, s ang pe Bs 

slightly larger than to have a a oe 


common divisor. 


Exercice. Show that the asymptotic expectation of the arithmetic random 
variable X,,(z, y) = gcd(z, y) on {1,...,n]? is infinite. 


Example. A large class of arithmetic random variables is defined by 
X,(k) aay p(n, k) mod q(n, k) 


on 2, = {0,...,2—1} where p and q are not simultaneously linear poly- 
nomials. We will look more closely at the following two examples: 


1) Xn(k) =n? +emodk 
2) Xn(k) =k? +cmodn 


Definition. Two sequences X,Y; of arithmetic random variables, (where 
Xn, Yn are defined on the same probability spaces 2), are called uncor- 
related if Cov[X,,, Y,] = 0. The are called asymptotically uncorrelated, if 
their asymptotic correlation is zero: 


Cov[Xn, Yn] — 0 
for 7. — oo. 


Definition. Two sequences X,Y of arithmetic random variables are called 
independent if for every n, the random variables X,Y, are independent. 
Two sequences X,Y of arithmetic random variables with values in [0,7] 
are called asymptotically independent, if for all J, J, we have 


Xn ¥, Xn Yn 
P[i— e€/, — €J|—-P[— el] Pi— e J] - 0 
nr nr nr n 


for n — oo. 
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Remark. If there exist two uncorrelated sequences of arithmetic random 
variables U,V such that ||Un — Xn||z2(0,) > 0 and ||Vn — Ynl|z2(0,) 7 9, 
then X,Y are asymptotically uncorrelated. If the same is true for indepen- 
dent sequences U,V of arithmetic random variables, then X,Y are asymp- 
totically independent. 


Remark. If two random variables are asymptotically independent, they are 
asymptotically uncorrelated. 


Example. Two arithmetic random variables X,,(k) = k mod n and Y;,(k) = 
ak+6 mod n are not asymptotic independent. Lets look at the distribution 
of the random vector (Xn, Yn) in an example: 


Figure. The figure shows 
the points (Xn(k),Yn(k)) for 
Xn(k) = k,Yn(k) = 5k +3 
modulo n in the case n = 2000. 
There is a clear correlation be- 
tween the two random variables. 


Exercice. Find the correlation of X,(k) = k mod n and Y,(k) = 5k + 
3 mod n. 


Having asymptotic correlations between sequences of arithmetic random 
variables is rather exceptional. Most of the time, we observe asymptotic 
ndepenaeHey, Here are some examples: 


metnple Consider the two arithmetic variables X,(k) = k and 

Yn(k) = ck! mod p(n) , 
where c is a constant and p(n) is the n’th prime number. The random 
variables X, and Y, are asymptotically independent. Proof: by a lemma of 


Merel (67, 23], the number of solutions of (x,y) € I x J of zy = c mod p is 


a + O(p*/? log”(p)) . 


This means that the probability that X,/n € In, Yn/n € Jn is [In| - |Jnl- 
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Figure. Illustration of the lemma 
of Merel. The picture shows the 
points {(k,1/k) mod p }, where 
p is the 200’th prime number 
p(200) = 1223. 


Nonlinear polynomial arithmetic random variables lead in general to asymp- 
totic independence. Lets start with an experiment: 


Figure. We see the points 
(Xn(k), ¥n(k)) for Xn(k) = 
k,Yn(k) = k? +3 in the case 
n = 2001. Even so there are 
narrow regions in which some 
correlations are visible, these 
regions become smaller and 
smaller for n — co. Indeed, we 
will show that Xn, Yn are asymp- 
totically independent random 
variables. 


The random variable X,,(k) = (n? +c) mod k on {1,...,n} is equivalent 
to Xn(k) =n mod k on {0,...,[/n —c] }, where [z] is the integer part of 
z. After the rescaling the sequence of random variables is easier to analyze. 


To study the distribution of the arithmetic random variable X,, we can 
also rescale the image, so that the range in the interval [0,1]. The random 
variable Y,, = Xn(x-|Q,|) can be extended from the discrete set {k/|Qn|)} 
to the interval [0, 1]. Therefore, instead of n? + c mod k, we look at 


nmodk_n 


a mee ae 


Xn(k) = 


on Qin) = {1,...,m(n) }, where m(n) = /n—c. 


Elements in the set X~1(0) are the integer factors of n. Because factoring 
is a well studied NP problem, the multi-valued function X—! is probably 
hard to compute in general because if we could compute it fast, we could 
factor integers fast. 
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Proposition 5.10.2. The rescaled arithmetic random variables 


Xq(h) = BROCE 


converge in law to the uniform distribution on [0, 1]. 


Proof. The functions f"(k) = n/(k+r)—[n/(k+r)]| are piecewise continuous 


circle maps on [0,1]. When rescaling the argument (0,...,n], the slope of 
the graph becomes larger and larger for n —> oo. We can use lemma (5.10.3) 
below. O 


Figure. Data points 


n mod a 
; k 

for n = 10/000 and 1 < k < 
n. For smaller values of k, the 
data points appear random. The 


points are located on the graph of 
the circle map 


falt) = > - (21. 


(k, 


To show the asymptotic independence of X, with any of its translations, 
we restrict the random vectors to [1,1/n*] with a < 1. 


Lemma 5.10.3. Let f,, be a sequence of smooth maps from (0, 1] to the circle 
T! = R/Z for which (f,!)’(z) > 0 uniformly on [0,1], then the law py, of 
the random variables X,,(x) = (z, fn(x)) converges weakly to the Lebesgue 
measure ps = dxdy on [0,1] x T?. 


Proof. Fix an interval [a, 6] in (0, 1]. Because z,([a, b] x T*) is the Lebesgue 
measure of {(z,y) |Xn(z,y) € [a, ]} which is equal to b — a, we only need 
to compare 

Ln([a, b] x [e,c + dy]) 


and 


Hn([a, b] x [d,d + dy]) 
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in the limit n — oo. But pn([a,b] x [c,c + dy]) — pn([a, 6] x [c,c + dy]) is 
bounded above by 


Ifa )'() ~ (fr @)I Sy’ @)| 


which goes to zero by assumption. 


Figure. Proof of the lemma. The 
measure Un with support on the 
graph of f(x) converges to the 
Lebesgue measure on the prod- 
uct space [0,1] x T'. The con- 
dition f/f’? — 0 assures that 
the distribution in the y direction 
smooths it out. . 


Theorem 5.10.4. Let c be a fixed integer and X,(k) = (n? +c) mod k 
on {1,...,n} For every integer r > 0,0 < a < 1, the random variables 
X(k),Y(k) = X(k +7) are asymptotically independent and uncorrelated 
on (0, n°]. 


Proof. We have to show that the discrete measures Be 6(X(k), Y(k)) 
converge weakly to the Lebesgue measure on the torus. To do so, we first 
look at the measure pn = i ae 6(X(k),Y(k)) which is supported on 
the curve t +> (X(t), Y(t)), where k € [0,n°] with a < 1 converges weakly 
to the Lebesgue measure. When rescaled, this curve is the graph of the 
circle map f,(x) = 1/2 mod 1 The result follows from lemma (5.10.3). O 


Remark. Similarly, we could show that the random vectors (X(k), X(k + 
r)),X(k+12),...,X(k+7)) are asymptotically independent. 


Remark. Polynomial maps like T(z) = x” +c are used as pseudo random 
number generators for example in the Pollard p method for factorization 
[84]. In that case, one considers the random variables {0,...,n — 1} de 
fined by Xo(k) = k, Xn4i(k) = T(Xn(k)). Already one polynomial map 
produces randomness asymptotically as n — oo. 
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Theorem 5.10.5. If p is a polynomial of degree d > 2, then the distribution 
of Y(k) = p(k) mod n is asymptotically uniform. The random variables 
X(k) = k and Y(k) = p(k) mod n are asymptotically independent and 
uncorrelated. 


SSS 


Proof. The map can be extended to a map on the interval [0, n]. The graph 


(x, T(x)) in {1,...,n} x {1,...,n} has a large slope on most of the square. 
Again use lemma (5.10.3) for the circle maps f,(x) = p(nxz) mod n on 
(0, 1]. oO 


Figure. The slope of the graph 
of p(x) mod n becomes larger 
and larger as n — oo. Choos- 
ing an integer k € [0,n] pro- 
duces essentially a random value 
p(k) mod n. To prove the asymp- 
totic independence, one has to 
verify that in the limit, the push 
forward of the Lebesque measure 
on [0,n] under the map f(x) = 
(z,p(z)) mod n converges in 
law to the Lebesgue measure on 
(0, n]?. 


Remark. Also here, we deal with random variables which are difficult to 
invert: if one could find Y~'(c) in O(P(log(n)) times steps, then factoriza- 
tion would be in the complexity class P of tasks which can be computed 
in polynomial time. The reason is that taking square roots modulo n is at 
least as hard as factoring is the following: if we could find two square roots 
z,y of a number modulo n, then x? = y? mod n. This would lead to factor 
gcd(x — y,n) of n. This fact which had already been known by Fermat. If 
factorization was a NP complete problem, then inverting those maps would 
be hard. 


Remark. The Mobius function is a function on the positive integers defined 
as follows: the value of (7) is defined as 0, if n has a factor p’ with a prime p 
and is (—1)*, if it contains k distinct prime factors. For example, (14) = 1 
and (18) = 0 and (30) = —1. The Mertens conjecture claimed hat 


M(n) = |u(1) +--+ u(n)| < CVn 


for some constant C. It is now believed that M (n)//n is unbounded but it 
is hard to explore this numerically, because the Vlog log(n) bound in the 
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law of iterated logarithm is small for the integers n we are able to compute 
- for example for n = 10'°, one has \/loglog(n) is less then 8/3. The fact 


Mo) 7M) +0 


is known to be equivalent to the prime number theorem. It is also known 

that lim sup M(n)//n > 1.06 and lim inf M(n)/./n < —1.009. 

If one restricts the function yp to the nee alee spaces 2, of all 

numbers < n which have no repeated prime factors, one obtains a sequence 
.of number theoretical random variables X,,, which take values in {-1,1}. 

Is this sequence asymptotically independent? Is the sequence p(n) random 

enough so that the law of the iterated logarithm 


lim ip. a ns < 
noo £—7 \/2nloglog(n) 


holds? Nobody knows. The question is probably very hard, because if it 
were true, one would have 


M(n) <n'/?+€, for alle > 0 


which is called the modified Mertens conjecture . This conjecture is known 
to be equivalent to the Riemann hypothesis, the probably most notori- 
ous unsolved problem in mathematics. In any case, the connection with 
the Mébius functions produces a convenient way to formulate the Rie- 
mann hypothesis to non-mathematicians (see for example [14]). Actually, 
the question about the randomness of y(n) appeared in classic probability 
text books like Fellers. Why would the law of the iterated logarithm for 
the Mobius function imply the Riemann hypothesis? Here is a sketch of 
the argument: the Euler product formula - sometimes referred to as ”the 
Golden key” - says 


d= a= Tl a-5)7. 
n=1 p prime 


The function ¢(s) in the above formula is called the Riemann zeta function. 
With M(n) < n'/?+©, one can conclude from the formula 


that ¢(s) could be extended analytically from Re(s) > 1 to any of the 
half planes Re(s) > 1/2 + e. This would prevent roots of ¢(s) to be to the 
right of the axis Re(s) = 1/2. By a result of Riemann, the function A(s) = 
m—8/2T(s/2)¢(s) is a meromorphic function with a simple pole at s = 1 and 
satisfies the functional equation A(s) = A(1 — s). This would imply that 
¢(s) has also no nontrivial zeros to the left of the axis Re(s) = 1/2 and 
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that the Riemann hypothesis were proven. The upshot is that the Riemann 
hypothesis could have aspects which are rooted in probability theory. 


Figure. The sequence Xp = 
p(l(k)), where I(k) is the k 
nonzero entry in the sequence 
{u(1),u(2), 1(3),--.} produces a 
*random walk” Sn = > ,=1 Xk: 
While X; is a deterministic se- 
quence, the behavior of Sy re- 
sembles a typical random walk. 
If that were true and the law of 
the iterated logarithm would hold, 
this would imply the Riemann 
hypothesis. 


5.11 Symmetric Diophantine Equations 


Definition. A Diophantine equation is an equation f (z1,..-,2k) = 0, where 
pis a polynomial in k integer variables 21,..., 2k and where the polynomial 
f has integer coefficients. The Diophantine equation has degree m if the 
polynomial has degree m. The Diophantine equation is homogeneous, if 
every summand in the polynomial has the same degree. A homogeneous 
Diophantine equation is also called a form. 


Example. The quadratic equation z? + y* — 2? = 0 is a homogeneous 
Diophantine equation of degree 2. It has many solutions. They are called 
Pythagorean triples. One can parameterize them all with two parameters 
s,t with x = 2st,y = s?—t?, z = s? +2?, as has been known since antiquity 
already [15]. 


Definition. A Diophantine equation of the form 


p(21,---, Lk) = P(Y1, +++ Ye) 


is called a symmetric Diophantine equation. More generally, a Diophantine 
equation 


is called an Euler Diophantine equation of type (k,/) and degree m. It isa 
symmetric Diophantine equation if k = 1. [28, 35, 15, 4, 5] 


Remark. An Euler Diophantine equation is equivalent to a symmetric Dio- 
phantine equation if m is odd and k +1 is even. 
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Definition. A solution (21,.., 2%), (y1,... ,¥k) to a symmetric Diophantine 
equation p(x) = p(y) is called nontrivial, if {r1,...,@% } and {y1,..., y% } 
are different sets. For example, 5° + 73 + 33 = 33 + 73 + 53 is a trivial 
solution of p(x) = p(y) with p(z, y, z) = 23 + y3 + 23. 

The following theorem was proved in [68]: 


oe 


Theorem 5.11.1 (Jaroslaw Wroblewski 2002). For k > m, the Diophantine 
equation cf +--+ +27 = yM4...4 yy has infinitely many nontrivial 
solutions. 


ee 


Proof. Let R be a collection of different integer multi-sets in the finite 
set [0,...,n]*. It contains at least n*/k! elements. The set S = {p(x) = 
ay ++) +208 € (0, Vkn™/?] |c ER } contains at least n*/k! numbers. 
By the pigeon hole principle, there are different multi-sets x,y for which 
p(x) = p(y). This is the case if n*/k! > Vkn™ or n*k-™ > klk. O 


The proof generalizes to the case, where p is an arbitrary polynomial of 
degree m with integer coefficients in the variables T1,...,2p. 


Theorem 5.11.2. For an arbitrary polynomial p in k variables of degree 
m, the Diophantine equation p(x) = p(y) has infinitely many nontrivial 
solutions. 


——v eee 


Remark. Already small deviations from the symmetric case leads to local 
constraints: for example, 2p(x) = 2p(y) + 1 has no solution for any nonzero 
polynomial p in k variables because there are no solutions modulo 2. 


Remark. It has been realized by Jean-Charles Meyrignac, that the proof 
also gives nontrivial solutions to simultaneous equations like p(x) = p(y) = 
p(z) etc. again by the pigeon hole principle: there are some slots, where more 
than 2 values hit. Hardy and Wright [28] (theorem 412) prove that in the 
case k = 2,m = 3: for every r, there are numbers which are representable 
as sums of two positive cubes in at least r different ways. No solutions 
of z+ yt = 2$+y$ = 24+ y4 were known to those authors [28], nor 
whether there are infinitely many solutions for general (k,m) = (2,m). 
Mahler proved that 2° + y? + 23 = 1 has infinitely many solutions. It is 
believed that 2°+y3+42z3+w®’ = mn has solutions for all n. For (k, m) = (2,3), 
multiple solutions lead to so called taxi-cab or Hardy-Ramanujan numbers. 


Remark. For general polynomials, the degree and number of variables alone 
does not decide about the existence of nontrivial solutions of P(21,..-, LR) = 
p(yi,---, Yk). There are symmetric irreducible homogeneous equations with 
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k < m/2 for which one has a nontrivial solution. An example is p(z,y) = 
x — 4y° which has the nontrivial solution p(1,3) = p(4, 5). 


Definition. The law of a symmetric Diophantine equation p(x1,...,2%) = 
p(r1,.--,2k) with domain 2 = {0,..., n|* is the law of the random variable 
defined on the finite probability space 2. 


Remark. Wroblewski’s theorem holds because the random variable has an 
average density which is larger than the lattice spacing of the integers. So, 
there have to be different integers, which match. The continuum analog is 
that if a random variable X on a domain 2) takes values in [a, b] and b—a 
is smaller than the area of 9, then the density fx is larger than 1 at some 
point. 


Remark. Wroblewski’s theorem covers cases like 2? + y? +2? = u?+v?+w? 
or? +y24+234w? = a3 4b? +c? +4?. It is believed that for k > m/2, 
there are infinitely many solutions and no solution for k < m/2. [59]. 


Remark. For homogeneous Diophantine equations, it is enough to find a 
single nontrivial solution (z1,...,2;) to obtain infinitely many. The reason 
is that (mx1,...,m2x) is a solution too, for any m # 0. 
Here are examples of solutions. Sources are [69, 35, 15]: 


k=2,m=4 (59, 158)4 = (133, 134)4 (Euler, gave algebraic solutions in 1772 and 1778) 

k=2,m=5 (open problem ((35]) all sums < 1.02 - 1026 have been tested) 

k=3,m=5 (3, 54, 62) = (24, 28, 67) ({59], two parametric solutions by Moessner 1939, Swinnerton-Dyer) 
k=3,m=6 (3,19, 22)6 = (10,15, 23)6 ((28],Subba Rao, Bremner and Brudno parametric solutions) 
k=3,m=7 open problem? 


k=4.m=7 (10, 14,123, 149)” = (15, 90, 129, 146)? (Ek) 


k=4,m=8 open problem? 

k=5,m=7 (8, 13, 16,19)” = (2, 12,15, 17, 18)” ([59]) 

k=5,m=8 (1, 10, 11, 20, 43)8 = (5, 28, 32, 35, 41)8. 

k=5.m=9 (192, 101, 91, 30, 26)9 = (180, 175, 116, 17, 12)? (Randy Ekl, 1997) 

k=5,m=10 open problem 

k=6,m=3 (3, 19, 22)© = (10, 15, 23)® (Subba Rao [59}) 

k=6,m=10 (95, 71, 32, 28, 25, 16)!9 = (92, 85, 34, 34, 23, 5)!0 (Randy Ek!,1997) 

k=6,m=11 open problem? , 

k==7,m=10 (1, 8, 31, 32, 55, 61, 68)!9 = (17, 20, 23, 44, 49, 64, 67)! ([59)) 

k=7,m=12 (99, 77, 74, 73, 73, 54, 30)!2 = (95, 89, 88, 48, 42, 37, 3)!2 (Greg Childers, 2000) 
k=7,m=13 open problem? 

k=8,m=11 (67, 52, 51, 51, 39, 38, 35, 27)1) = (66, 60, 47, 36, 32, 30, 16, 7)11 (Nuutti Kuosa, 1999) 
k=20,m=21 (76, 74, 74, 64, 58, 50, 50, 48, 48, 45, 41, 32, 21, 20, 10, 9, 8, 6, 4, 4)?2 

= (77, 73, 70, 70, 67, 56, 47, 46, 38, 35, 29, 28, 25, 23, 16, 14,11, 11, 3, 3)? (Greg Childers, 2000) 
k=22,m=22 (85, 79, 78, 72, 68, 63, 61, 61, 60, 55, 43, 42, 41, 38, 36, 34, 30, 28, 24, 12, 11, 11)? 


= (83, 82, 77,77, 76, 71, 66, 65, 65, 58, 58, 54, 54,51, 49, 48, 47, 26, 17, 14, 8, 6)22 (Greg Childers, 2000) 
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Figure. Known cases of (k,m) 

with nontrivial solutions z,i7 

of symmetric Diophantine equa- fieneae inant aneseese! ceceeeeeal 
tions g(Z) = g(¥) with g(Z) = " 

2U+-++:+2%. Wroblewski’s theo- 
rem assures that fork > m, there 
are solutions. The points above 
the diagonal beat Wroblewski’s 
theorem. The steep line m = 
2k is believed to be the thresh- 
old for the existence of nontrivial 
solutions. Above this line, there 
should be no solutions, below, 
there should be nontrivial solu- 
tions. 


What happens in the case k = m? There is no general result known. The 
problem has a probabilistic flavor because one can look at the distribution 
of random variables in the limit n — ov: 


Lemma 5.11.3. Given a polynomial p(z),...,2,) with integer coefficients 
of degree &. The random variables 


Xn(21, tee , Cr) = p(x, ye) in” 


on the finite probability spaces 2, = [0,...,n]* converge in law to the 
random variable X(21,...,2n) = p(21,..,%%) on the probability space 
((0, 1]*, B,P), where B is the Borel c-algebra and P is the Lebesgue mea- 
sure. 


Proof. Let So,»(n) be the number of points (1,...,2%) satisfying 


p(21,.--,2x) € [n®a, n*O] . 


This means 
Sa,o(n) 
nk 


= F,,(b) — F,(a) , 


where F,, is the distribution function of X,,. The result follows from the fact 
that F,(b)—Fn(a) = Sa(n)/n* is a Riemann sum approximation of the in- 
tegral F(b) — F(a) = Sa. , 1 dx, where Aa» = {x € [0, 1]* | X(a1,.--, 2k) € 
(a, b) }. oO 


Definition. Lets call the limiting distribution the distribution of the sym- 
metric Diophantine equation. By the lemma, it is clearly a piecewise smooth 
function. 
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Example. For k = 1, we have F(s) = P[X(z) < s] = Plc™ < s|=sl/™/n. 
The distribution for k = 2 for p(a,y) = 27 +y? and p(z,y) = g? — y? 
were plotted in the first part of these notes. The distribution function of 
p(r1,22,---, 2x) is a k’'th convolution product F, = F x ---* F, where 
F(s) = O(s!/™) near s = 0. The asymptotic distribution of p(x, y) = z?+y? 
is bounded for all m. The asymptotic distribution of p(z,y) = gy? —y? 
is unbounded near s = 0 Proof. We have to understand the laws of the 
random variables X (x,y) = x?+y? on 0, 1]?. We can see geometrically that 
(1/4)s? < Fx(s) < s?. The density is bounded. For Y (x,y) = 2? — y*, we 
use polar coordinates F(s) = {(r,) | r? cos(20)/2 < s }. Integration shows 
that F(s) = Cs? + f(s), where f(s) grows logarithmically as — log(s). For 
m > 2, the area z™ — y™ <8 is piecewise differentiable and the derivative 
stays bounded. 


Remark. If p is a polynomial of k variables of degree k. If the density 
f = F' of the asymptotic distribution is unbounded, then then there are 
solutions to the symmetric Diophantine equation p(x) = p(y). 


a 


Corollary 5.11.4. (Generalized Wroblewski) Wroblewski’s result extends to 
polynomials p of degree k for which at least one variable appears in a term 
of degree smaller than k. 


Proof. We can assume without loss of generality that the first variable 
is the one with a smaller degree m. If the variable x1; appears only in 
terms of degree k — 1 or smaller, then the polynomial p maps the finite 
space [0, n]*/™ x [0, n]*-? with n*+*/™—! — n**t« elements into the interval 
[min(p), max(p)] C [-Cn*,Cn*]. Apply the pigeon hole principle. O 


Example. Let us illustrate this in the case p(x, y, z,w) = +23 +24+u%. 
Consider the finite probability space Q, = [0,n] x {0,n] x [0, n4/3) x [0,n] 
with n4+1/3, The polynomial maps 2, to the interval [0,4n*]. The pigeon 
hole principle shows that there are matches. 


NN 


Theorem 5.11.5. If the density f, of the random variable p on a surface 
Qc [0,n]* is larger than k!, then there are nontrivial solutions to p(z) = 
p(y). 


a ——. 


In general, we try to find a subsets 2 Cc [0,n]* c R* which contains n*—9 
points which is mapped by X into (0,n~°]. This includes surfaces, sub- 
sets or points, where the density of X is large. To decide about this, we 
definitely have to know the density of X on subsets. This works often be- 
cause the polynomials p modulo some integer number L do not cover all 
‘the conjugacy classes. Much of the research in this part of Diophantine 


348 Chapter 5. Selected Topics 


equations is devoted to find such subsets and hopefully parameterize all of 
the solutions. 


ee 


| : Ae 


Figure. X(z,y,z) = 2° + y3 + 23. Figure. X(z,y, z) = 23 + y3 — z3 


Exercice. Show that there are infinitely many integers which can be written 


in non trivially different ways as x4 + y4 + 24 — w?. 


Remark. Here is a heuristic argument for the rule of thumb” that the Euler 
Diophantine equation 77 + . + xy = zg has infinitely many solutions for 
k >m and no solutions if k < m. 


For given n, the finite probability space Q = {(a1,..., 0") |O< 2; <nl/™ } 
contains n*/™ different vectors 2 = (z1,...,@,). Define the random variable 


X(z) = (ap +--+ ary/m | 


We expect that X takes values 1 /nk/m — n™/k close to an integer for large 
n because Y(xr) = X(x) mod 1 is expected to be uniformly distributed on 
the interval [0,1) as n — oo. 


How close do two values Y(x),Y(y) have to be, so that Y(x) = Y(y)? 
Assume Y (xr) = Y(y) + €. Then 


X(x)™ = X(y)™ + eX(y)™* + O(e?) 
with integers X(x)™, X(y)™. If X(y)™—1e < 1, then it must be zero so that 
Y(z) = Y(y). With the expected € = n™/* and X(y)™-! < Cn(™-1)/m we 
see we should have solutions if k > m— 1 and none for k < m — 1. Cases 


like m = 3,k = 2, the Fermat Diophantine equation 


2 +43 = 23 
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are tagged as threshold cases by this reasoning. 


This argument has still to be made rigorous by showing that the distri- 
bution of the points f(z) mod 1 is uniform enough which amounts to 
understand a dynamical system with multidimensional time. We see nev- 
ertheless that probabilistic thinking can help to bring order into the zoo 
of Diophantine equations. Here are some known solutions, some written in 
the Lander notation 


x = (a1,..., 04)" = ap +e + ay. 


m = 2k = 2: 22 + y? = 22 Pythagorean triples like 3? + 42 = 5? (1900 BC). 
m=3k=2: 2% +y™ = z™ impossible, by Fermat’s theorem. 
m=3,k=3: 23 + y3 + u3 = v3 derived from taxicab numbers, like 103 + 93 = 13 + 123 (Viete 1591). 
m = 4,k = 3: 26824404 + 153656394 + 187967604 = 206156734 (Elkies 1988 [24]) m = 5,k = 3: like 
2? + ye + 25 = w> is open 

m = 4,k = 4: 304 +1204 + 2724 + 3154 = 3534. (R. Norrie 1911 [35]) 

m =5,k = 4275 +845 41105 +1335 = 1445 (Lander Parkin 1967). 

m=6,k=5: 8 4 yS 4 26 4 uS 4 8 = wh is open. 

m = 6,k = 6: (74, 234, 402, 474, 702, 894, 1077)® = 11418. 

m = 7,k = 7: (525, 439, 430, 413, 266, 258, 127)? = 5687 (Mark Dodrill, 1999) 

m = 8, k = 8: (1324, 1190, 1088. 748, 524, 478, 223, 90)® = 14098 (Scott Chase) 


m = 9, k = 12, (91, 91, 89, 71, 68, 65, 43, 42, 19, 16, 13, 5)9 = 1039 (Jean-Charles Meyrignac,1997) 


5.12 Continuity of random variables 


Let X be a random variable on a probability space (Q,A,P). How can 
we see from the characteristic function ¢x whether X is continuous or 
not? If it is continuous, how can we deduce from the characteristic function 
whether X is absolutely continuous or not? The first question is completely 
answered by Wieners theorem given below. The decision about singular 
or absolute continuity is more subtle. There is a necessary condition for 
absolute continuity: 


Theorem 5.12.1 (Riemann Lebesgue-lemma). If X € L', then ¢x(n) — 0 
for |n| — oo. 


Proof. Given € > 0, choose n so large that the n’th Fourier approximation 
Xn(z) = ae ox(n)e'" satisfies ||X — Xn|l1 < €. For m > n, we have 
bm(Xn) = Ele*™*"] = 0 so that 


|x (m)| = |ox-x,,(m)| S$ ||X — Xnlli Se. 
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Remark. The Riemann-Lebesgue lemma can not be reversed. There are 
random variables X for which ¢x(n) — 0, but which X is not in L!. 
Here is an example of a criterion for the characteristic function which as- 
sures that X is absolutely continuous: 


_--- ere 


Theorem 5.12.2 (Convexity). If an = a_n satisfies a, — 0 for n — oo and 
Gn+1 — 2an +Qn_1 > 0, then there exists a random variable X € £) for 
which ¢x(n) = an. 


—_—— eo eee 


Proof. We follow [48]. 

(i) bn = @n — Qn41 decreases monotonically. 

Proof: the convexity condition is equivalent to Gn — Qn41 < Gn_1 — Gn. 
(ii) by = Gn — Gn41 is non-negative for all n. 

Proof: b, decreases monotonically. If some b, = c < 0, then by (i), also 
bm < c for all m contradicting the assumption that bn — 0. 

(iii) Also nb, goes to zero. 

Proof: Because }~¢ k=1 (2k ~@k+1) = @1—Gn+41 is bounded and the summands 
are positive, we must have k(a;, — ax41) > 0. 

(iv) ogai R(@e—1 — 20% + @%41) 3 0 for n > 00. 

Proof. This sum simplifies to a9 — @n41 — (Qn — Qn41. By (iiii), it goes to 
0 for n — oo. 

(v ) The random variable Y(x) = 72, k(ax—1 — 2ax + @e41)Ky(zx) is in 
L’, if K,(2) is the Féjer kernel with Fourier coefficients 1 — \9|/(k + 1). 
Proof. The Féjer kernel is a positive summability kernel and satisfies 


1 2a 
ell = 5 [ Ke(e) de = 


for all k. The sum converges by (iv). 
(vi) The random variables X and Y have the same characteristic functions. 
Proof. 


dy (n) = ae k(@p-1 — 2an+ Qn41) Ki (n) 
k=1 


= ce 1 ~ 2ap + Qk41)(1 — ——— al a) 


rs k+1 
= WL 
= y ie 1 — 2ax + Qe41)(1 - ——) = a, . 
k+1 
n+1 


O 


For bounded random variables, the existence of a discrete component of 
the random variable X is decided by the following theorem. It will follow 
from corollary (5.12.5) given later on. 


Sed be Eee ESE a, a ee ee ee er 


Theorem 5.12.3 (Wiener theorem). Given X € £L° with law py supported 
in {—7, 7] and characteristic function ¢ = ¢x. Then 


jim, ales (k)P = SUPIX =a)”. 


zER 


Therefore, X is continuous if and only if the Wiener averages 
4 her |¢x (k)|? converge to 0. 


Lemma 5.12.4. If 4 is a measure on the circle T with Fourier coefficients 
fix, then for every z € T, one has 


u({z}) = lim 


Proof. We follow [48]. The Dirichlet kernel 


Dat) ott = Sule 


k=-n 
satisfies 
Da (2) = Sa(f)(0) = Yo flee. 
k=-n 
The functions 


n 


1 
2n+1 


eine ent 


fn(t) = D,(t — 


are bounded by 1 and go to zero uniformly outside any neighborhood of 
t= az. From 


E+E 
lim [ lala w({2})55)1 = 0 


follows 
dim (fas — w({z})) = 
so that 


je™* — w({x}) > 0 


oe MEW) = 5 
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Theorem 5.12.7 (Strichartz). Let be a uniformly h-continuous measure 
on the circle. There exists a constant C such that for all n 


= le? so-n(4). 


k=1 


Proof. The computation ([102, 103] for the Fourier transform was adapted 
to Fourier series in [51]). In the following computation, we abbreviate du(z) 
with dz: 


n-1 1 n—1 Be 
1 4 € on 7 
=D Wile Si ef s dO |jixl? 
k=-n 0 k=-n 
1 n-1 = k+0)? 
=. € | > camel i e 2k dardydo 
0,--, =” T? 
1 n-1 — E49" _i(a—y)k 
=3 € [ J yo ——_ doderdy 
T?J0 pon n 


1 2 
a e | i eo Sawn st i(e—y)o 
T? JO 


nal 4 (HEE 4i(e—y) $)? 


d@dxdy 
k=-n 
and continue 
1 n—-1 ( yon? 1 
= Al, Ss ef ee | f 
5 Oe al c 
nal ~ (i442 +(2—-y) 3)? 
———_——_———. d6| dxdy 


k=—-n 
00 9—(£+i(e—y) 3)? 2 
moe ff gee acy 
T2 J—oo nr 
=7 e/a [ (e-@-9**) dedy 
T2 


n2 
<s eva | eG de dy)? 
T2 
fo @) 


n2 
iS | HF aed” 
k=0 Y k/nS|2—y|S(k+1)/n 
oo 
<10 eVnCyh(n-)(- e“#/2)1/2 
k=0 


<u Ch(n7!) ‘ 
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are tagged as threshold cases by this reasoning. 


This argument has still to be made rigorous by showing that the distri- 
bution of the points f(x) mod 1 is uniform enough which amounts to 
understand a dynamical system with multidimensional time. We see nev- 
ertheless that probabilistic thinking can help to bring order into the zoo 
of Diophantine equations. Here are some known solutions, some written in 
the Lander notation 


Z™ = (1,..., 24)" =a t+---+2%. 


m = 2k = 2: 2? +y? = 2? Pythagorean triples like 32 + 42 = 5? (1900 BC). 


m = 3k = 2: 2 +y™ = z™ impossible, by Fermat’s theorem. 
m=3,k=3: 23 + ye +u? = v9 derived from taxicab numbers, like 103 +93 = 13 + 123 (Viete 1591). 


m = 4,k = 3: 26824404 + 153656394 + 187967604 = 206156734 (Elkies 1988 [24]) m = 5,k = 3: like 


2 +75 +25 =w? is open 


m = 4,k = 4: 304 + 1204 + 2724 4 3154 = 3534. (R. Norrie 1911 (35]) 


m= 5,k = 4 27° +845 41105 + 1335 = 1445 (Lander Parkin 1967). 


m=6,k=5: x8 + y8 4 26 4 46 4 YS — WS is open. 


m = 6,k = 6: (74, 234, 402, 474, 702, 894, 1077)® = 11418. 


m = 7,k = 7: (525, 439, 430, 413, 266, 258, 127)? = 5687 (Mark Dodrill, 1999) 


m = 8,k = 8: (1324, 1190, 1088, 748, 524, 478, 223, 90)8 = 14098 (Scott Chase) 


m=9,k = 12, (91,91, 89, 71, 68, 65, 43, 42, 19, 16, 13, 5)9 = 1039 (Jean-Charles Meyrignac, 1997) 


5.12 Continuity of random variables 


Let X be a random variable on a probability space (QA, P). How can 
we see from the characteristic function ¢x whether X is continuous or 
not? If it is continuous, how can we deduce from the characteristic function 
whether X is absolutely continuous or not? The first question is completely 
answered by Wieners theorem given below. The decision about singular 
or absolute continuity is more subtle. There is a necessary condition for 
absolute continuity: 


Theorem 5.12.1 (Riemann Lebesgue-lemma). If X € CL’, then ¢x(n) > 0 
for |r| — oo. 


Proof. Given € > 0, choose n so large that the n’th Fourier approximation 
Xn(z) = De-_n Ox (n)e* satisfies ||X — X;||1 < . For m > n, we have 
bm(Xn) = Efe*”*=] = 0 so that 


|dx(m)| = |dx—x,,(m)| < ||X — Xalhi <e. 
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Remark. The Riemann-Lebesgue lemma can not be reversed. There are 
random variables X for which ¢x(n) — 0, but which X is not in C?. 
Here is an example of a criterion for the characteristic function which as- 
sures that X is absolutely continuous: 


Theorem 5.12.2 (Convexity). If an = a_n satisfies a, — 0 for n — oo and 
Gn+1 — 2Qn + Gn—1 > 0, then there exists a random variable X € L! for 
which ¢x(n) = an. 


Proof. We follow [48]. 

(i) bn = Gn — Qn41 decreases monotonically. 

Proof: the convexity condition is equivalent to a, — GQn+1 < An—1 — An. 
(ii) by = Gy — Gn41 is non-negative for all n. 

Proof: by, decreases monotonically. If some b, = c < 0, then by (i), also 
bm < c for all m contradicting the assumption that b, — 0. 

(iii) Also nb, goes to zero. 

Proof: Because ee 1(@k—@x%41) = @1—An41 is bounded and the summands 
are ea we must have k(a% — a%41) — 0. 

(iv) Sy 1 K(ax-1 — 2a, + A441) 3 0 for n > ©0. 

Proof. This sum simplifies to a9 — an41 — N(@n — Gn41. By (iiii), it goes to 
0 for n - oo. 

(v) The random variable Y(r) = Wher k(ae-1 — 2an + O41) Ke (x) is in 
L’, if K;,(x) is the Féjer kernel with Fourier coefficients 1 — |j|/(k +1). 
Proof, The Féjer kernel is a positive summability kernel and satisfies 


1 Qr 
Kells = =| K,(z) dz =1. 


for all k. The sum converges by (iv). 
(vi) The random variables X and Y have the same characteristic functions. 
Proof. 


dy(n) = S° k(ag_1 — 20% + a441)Ke(n) 


kel 
a ce 1 ~ 2a% + Qe41)(1 — —— ao ) 
~ ‘ k+1 
= So Hay 1 — 2a, + ax 1 - ll ——)=a,. 
rs iamiae € k+l 


0 


For bounded random variables, the existence of a discrete component of 
the random variable X is decided by the following theorem. It will follow 
from corollary (5.12.5) given later on. 
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Theorem 5.12.3 (Wiener theorem). Given X € LC® with law supported 
in [—z, 7] and characteristic function ¢ = ¢x. Then 


ile 
dim, | Llex()P = OPK =a) 
k=1 zER 


Therefore, X is continuous if and only if the Wiener averages 
1 Syhe1 1x (k)|? converge to 0. 


Lemma 5.12.4. If is a measure on the circle T with Fourier coefficients 
jx; then for every z € T, one has 


. 1 - a tke 
a 2 
=—-n 


Proof. We follow [48]. The Dirichlet kernel 


cant _ Sin((k +1/2)8) 
Pall) = en sin) 


k=—n 
satisfies 


Dn f(x) = Sal f)(@) = So Flke* . 


k=—n 


The functions 


1 1 eS Ss 
ss = t— = ing pint 
pO so ra ee . 


are bounded by 1 and go to zero uniformly outside any neighborhood of 
t= 2. From 


E+E 
tim [lal u(fa})6e)] = 0 


follows 
jim (fas u — H({z})) = 0 
so that 


Fant = sz DS ane" — ul{z2}) 0. 
k=—n 
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Definition. If , and v are two measures on (Q = T, A), then its convolution 
is defined as 


pxv(A) = [ w(A-2) do(e) 
T 
for any A € A. Define for a measure on [—7, 7] also u*(A) = (—A). 


Remark. We have fi*(n) = fi(n) and piv(n) = fi(n)0(n). If wp = D0 aj, 
is a discrete measure, then y* = )>Gj6_2,. Because yx p* = >); |a;|?, we 
have in general 


(ux w*)({0}) = S- |w({2})?| 


xeT 


Corollary 5.12.5. (Wiener) Der |M({z})|? = limnoo sy Dok=—n linl?- 


Remark. For bounded random variables, we can rescale the random vari- 
able so that their values is in [-7, 7] and so that we can use Fourier series 
instead of Fourier integrals. We have also 


R 
S— le{z})? = jim sR a | lar dt . 


zeER 


We turn our attention now to random variables with singular continuous 
distribution. For these random variables, one does have P[X = c| = 0 for 
all c. Furthermore, the distribution function Fx of such a random variable 
X does not have a density. The graph of F'x looks like a Devil staircase. 
Here is a refinement of the notion of continuity for measures. 


Definition. Given a function A: R — [0, 00) satisfying lim; h(x) = 0. A 
measure jz on the real line or on the circle is called uniformly h-continuous, 
if there exists a constant C such that for all intervals J = [a,b] on T the 
inequality 


(I) < C(I) 


holds, where |J| = b —a is the length of I. For h(x) = z* withO<a<l, 
the measure is called uniformly a-continuous. It is then the derivative of a 
a-Hélder continuous function. 


Remark. If is the law of a singular continuous random variable X with 
distribution function Fx, then Fx is a-Hélder continuous if and only if p is 
a-continuous. For general h, one calls F uniformly lip — h continuous {86}. 


Theorem 5.12.6 (Y. Last). If there exists C’, such that + pai fel? < 
Cc. (4) for all n > 0, then p is uniformly Vh-continuous. 
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Proof. We follow [56]. The Dirichlet kernel satisfies 


Y ml? = ff Day 2) duladanty) 


k=-—n 


and the Féjer kernel K,,(t) satisfies 


K,(t) (ante) 


n+1 \ sin(t/2) 
- KL | ane 
es pee. 
2 n+ ye 
= . || ikt 
= D,(t) ae = ic 


Therefore 


nie ser [lll = [ [ (Batu 2) ~ Kn(y ~a))dula)duty) 


Ye ban ff Kalu adute)anty) (5.4) 


k=—n 


Because fn = ree we can also sum from —n to n, changing only the 
constant C. If yw is not uniformly Vh continuous, there exists a sequence 
of intervals |J,| — 0 with p(I;) > l/A(|Li|). A property of the Féjer kernel 
K,(t) is that for large enough n, there exists 6 > 0 such that 1K,(t) > 
6 > Oif 1 < njt| < 2/2. Choose nj, so that 1 < m-|h| < m/2. Using 
estimate (5.4), one gets 


 |fel? Kn,(y —2) 
ee a ee 
> Su(h)? 2 dPA(\Nil) 
Oe) 

ny 


This contradicts the existence of C such that 


two, p. 1 
— < a ae 
a DS lanl <Ch(~) 


k=—-n 
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Theorem 5.12.7 (Strichartz). Let 4 be a uniformly h-continuous measure 
on the circle. There exists a constant C' such that for all n 


* Sly so: n(). 
k=1 ; 


Proof. The computation ({102, 103] for the Fourier transform was adapted 
to Fourier series in (51]). In the following computation, we abbreviate du(z) 
with dz: 


1 n-1 1 n-1 ie k+9)? 
~ yak sa ef ¥ 08 |i? 
os ae 0 k=—n ne 
1 n-1 — e+e? 
=9 e | », : eo dxrdyd@ 
Oyo. T? 
1 n-1 — 40)? _i(n—y)k 
=3 eff Sy, ——__- ddrrdy 
1? U0- poy n 
1 
T2 Jo 
mol (He +i(e—y) 3)? 
d@dxdy 
k=-n 
and continue 
LF at se ef oi [’ 
= He Ss ef ew’ 4 i 
is ae k T2 0 
mal —(St8 + (2-y) 3)? 
—_—_—_—_—_—_— d6| dzdy 


k=—-n 


00 9-(£+i(e—y)¥)? 2 
=6 a if —— dtje~(@-¥) dzdy 
T?2 J—oo 
n2 
aso [ee ty 


n2 
<e evat | eS dx dy)? 
T2 


oe: 2 
=» evi | ee de dy) 
pao V k/nS|z—y|<(k+1)/n 


<0 ev nC h(n™)()> esha) 
k=0 


<u Ch(n-1) . 
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Here are some remarks about the steps done in this computation: 
(1) is the trivial estimate 


p> 


ZS aa 


dé>1 


k=-—n 


(2) 
I e HOF du(z)du(y) = i eV du(a) [ e'**du(a) = firfig = \joe” 
T2 T T 


(3) uses Fubini’s theorem. 
(4) is a completion of the square. 
(5) is the Cauchy-Schwartz inequality, 


(6) replaces a sum and the integral {, by foo. 


(7) uses fr ere a = ./m because 
9° e7(t/n+b)? 
[ew 
ae 


for all n and complex 8, 

(8) is Jensen’s inequality. 

(9) splits the integral over a sum of small intervals of strips of width 1/n. 
(10) uses the assumption that ys is h-continuous. 

(11) This step uses that 


= 2 
oy e7* jaya 
k=0 


is a constant. | 
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